当前位置:网站首页>[tidb] importing TXT documents into the database is really efficient
[tidb] importing TXT documents into the database is really efficient
2022-07-28 21:09:00 【Digital China cloud base】
One 、 Preface
In the project , We often encounter the need of data import , For example, put txt perhaps csv Data in the file is imported into TiDB In the cluster . This was originally a simple thing , Until a bank project , The customer gives the time to import data , Limited a very short period , Invisible let me increase some pressure . Only then did I consider , In some businesses , This kind of data distributed in the form of documents , In a very short time , Complete the database import . That's why I started to study how to speed up 、 More secure data import to TiDB in .
Two 、TiDB Cluster preparation
Before importing data ,TiDB Of course, you need to prepare ~
In case of a write hotspot , Or the transaction is too large, causing the import to be too slow 、 Even fail , Which leads to , Data import is not complete , Isn't that embarrassing ...
1、 Avoid writing hotspots
In the scenario of massive data insertion , Probably there will be hot spots , In this case , We must make a response in advance .
1.1 Data insertion for new tables , We can create the table structure , Add two parameters
Example :
CREATE TABLE `t1` (
`id` int(11) NOT NULL,
) shard_row_id_bits = 4 pre_split_regions = 4;
1.2 For the scenario of importing additional data , When there is a table structure and historical data , Manual segmentation is required region
We can use split-table-region To deal with the problem of writing hotspots .

It should be noted that , The statement of creating table is added shard_row_id_bits and pre_split_regions After the parameter , You need to start importing data to the target table as soon as possible , Otherwise, the partition is empty region May be merge, Lead to write hot issues .
After solving the hot issue of writing , You also need to consider the insertion of a large amount of data , May lead to TiDB appear OOM The situation of .
2、 Avoid big business OOM
stay TiDB In the cluster , When inserting data , Every data will occupy tidb-server Node memory , When the amount of data inserted in a transaction is too large , It will lead to tidb-server The node is using too much memory , appear OOM The situation of . To solve the problem of big business , You need to turn on several parameters :
Add :
enable-batch-dml: trueSET @@GLOBAL.tidb_dml_batch_size = 20000;SET @@SESSION.tidb_batch_insert = 1;
These three parameters need to be used together , It means in a insert In the transaction of ,TiDB Will put a insert The big business of , Press 20000 Pieces of data are split into small transactions and submitted in batches , This is a good way to avoid tidb-server node OOM The situation of , But there will be some problems in doing so : No guarantee TiDB Atomicity and isolation requirements of , So it's not recommended . This requires everyone to choose according to their actual situation .
3、 ... and 、 File import test
1. Selection of import tools
1.1 Navicat
It has a very good interactive interface , User learning costs are almost 0, You can import data directly ,Navicat Internally, the data in the file will be executed in batches , It's not slow , But only in windows Operating in the system .
1.2 load data
Mysql General batch import method in , You can specify the hexadecimal separator ,TiDB Middle Division LOAD DATA...REPLACE INTO Beyond grammar ,LOAD DATA Statements should be fully compatible MySQL.
1.3 Lightning
It is a highly recommended tool for logical batch data import , Directly convert the data into key value pairs , And insert into TiKV in , So it is the fastest way to import data in theory , But use lightning There are also some inconvenient places , For example, multiple imports require frequent modification of configuration files , The file name must be “ Library name . Table name ” In the form of , Cannot import txt File data , If the import fails, you need to modify it manually TiKV Cluster is in normal mode, etc .
2. Data preparation
I prepared four different data volumes for this test csv file , File size in 100MB To 11GB :

The structure of the table is :
CREATE TABLE `sbtest` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`k` int(11) NOT NULL DEFAULT '0',
`c` char(120) NOT NULL DEFAULT '',
`pad` char(60) NOT NULL DEFAULT '',
PRIMARY KEY (`id`) /*T![clustered_index] NONCLUSTERED */,
KEY `k_1` (`k`)
) shard_row_id_bits = 4 pre_split_regions = 4 ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
3.TiDB Cluster ready

Each machine is equipped with 8 vCore,32G Memory , Gigabit bandwidth . Limited configuration , All tests are conducted in this cluster environment .
4. test result
| Navicat | load data | Lightning (local) | Lightning (tidb) | |
|---|---|---|---|---|
| 100W | 195.91( second ) | 42.23( second ) | 24.76( second ) | 103.14( second ) |
| 400W | 732.47( second ) | 180.99( second ) | 87.01( second ) | 399.44( second ) |
| 1000W | 1811.8( second ) | 463.42( second ) | 190.71( second ) | 978.6( second ) |
| 6000W | 11041.87( second ) | 2831.83( second ) | 1908.05( second ) | 6081.39( second ) |

5. Analysis of test results
1、 In addition to using Lightning Of local Pattern import data , Other import methods will more or less have write hotspots .
2、 Use Navicat Import data , The consumption of disk resources is much higher than other methods , And the speed is the slowest .
3、load data How to import , The consumption of existing resources is not high , Want to improve import efficiency , You can only import data manually in parallel .
4、Lightning Of local The mode imports data fastest , The resource consumption of the cluster is minimal .
5、Lightning Of tidb The consumption of resources by patterns is related to load data The import method of is similar , But the efficiency is lower than load data How to import .
6、 During the import process ,TiDB The memory used by the cluster is normal , None of them appeared OOM The situation of .
7、 A single file is too large , It will affect Lightning Of local The processing efficiency of the mode for files .
8、Lightning Of local Patterns of Lightning Of the machine CPU、 Hard disk and cluster network requirements are high .
9、Lightning When reading the file name , The case of the file name should be exactly the same as that of the target table , otherwise Lightning Will report a mistake .
Four 、 summary
Lightning More restrictions , But the function is also the most complete ,local The import speed of the mode suspends other import tools .
load data Support txt Import of file data , And you can also customize some general parameters , Such as field separator , The import speed is second only to Lightning Of local Pattern . be based on load data Develop a script that meets the business requirements , It's better than using it directly Lightning It's much easier to develop .
For the needs Lightning The function of , And incremental data import , It is recommended to use Lightning Of tidb Pattern .
Navicat It is only recommended for beginners to import a small amount of data .

Copyright notice : This article is organized and written by the team of Digital China cloud base , If reproduced, please indicate the source .
Official account search for digital cloud base in China , The background to reply TiDB, Join in TiDB Technology exchange group !
边栏推荐
- New development of letinar in Korea: single lens 4.55G, light efficiency up to 10%
- 4.1 Member的各种调用方式
- Link with bracket sequence I (state based multidimensional DP)
- Is it necessary to disconnect all connections before deleting the PostgreSQL database?
- protobuf 中基础数据类型的读写
- 图书馆借阅系统「建议收藏」
- Unity3d tutorial notes - unity initial 03
- Looking at SQL optimization from the whole process of one query
- 作业 ce
- Reading and writing basic data types in protobuf
猜你喜欢

Easynlp Chinese text and image generation model takes you to become an artist in seconds

Unity3d tutorial notes - unity initial 04

What is "security"? Volvo tells you with its unique understanding and action

How NPM switches Taobao source images

Eureka registers with each other, only showing each other or only showing problems in one

The EMC vnx5200 fault light is on, but there is no hardware fault prompt

Mobilevit: challenge the end-to-side overlord of mobilenet

How does lazada store make up orders efficiently? (detailed technical explanation of evaluation self-supporting number)

MoCo V3:视觉自监督迎来Transformer

Unity3d tutorial notes - unity initial 03
随机推荐
The average altitude is 4000 meters! We built a cloud on the roof of the world
Fragment中使用ViewPager滑动浏览页面
How NPM switches Taobao source images
Lazada店铺如何产号高效补单?(测评自养号技术详解篇)
Space shooting Lesson 15: props
mysql梳理复习内容--附思维导图
Applet container technology improves mobile R & D efficiency by 500%
MySQL sorts out the review content -- with mind map
[Zhou Zhou has a prize] cloud native programming challenge "edge container" track invites you to fight!
Meaning of disk status of EMC DataDomain
C # basic 3-value type and reference type, packing and unpacking
C # basic 5-asynchronous
[Topic] add two numbers
Using viewpager to slide through pages in fragment
图书馆借阅系统「建议收藏」
什么是低代码?哪些平台适合业务人员?用来开发系统靠不靠谱?
unity-shader-1
1 Introduction to command mode
Moco V1: the visual field can also be self supervised
Link with bracket sequence I (state based multidimensional DP)

