当前位置:网站首页>[tidb] importing TXT documents into the database is really efficient
[tidb] importing TXT documents into the database is really efficient
2022-07-28 21:09:00 【Digital China cloud base】
One 、 Preface
In the project , We often encounter the need of data import , For example, put txt perhaps csv Data in the file is imported into TiDB In the cluster . This was originally a simple thing , Until a bank project , The customer gives the time to import data , Limited a very short period , Invisible let me increase some pressure . Only then did I consider , In some businesses , This kind of data distributed in the form of documents , In a very short time , Complete the database import . That's why I started to study how to speed up 、 More secure data import to TiDB in .
Two 、TiDB Cluster preparation
Before importing data ,TiDB Of course, you need to prepare ~
In case of a write hotspot , Or the transaction is too large, causing the import to be too slow 、 Even fail , Which leads to , Data import is not complete , Isn't that embarrassing ...
1、 Avoid writing hotspots
In the scenario of massive data insertion , Probably there will be hot spots , In this case , We must make a response in advance .
1.1 Data insertion for new tables , We can create the table structure , Add two parameters
Example :
CREATE TABLE `t1` (
`id` int(11) NOT NULL,
) shard_row_id_bits = 4 pre_split_regions = 4;
1.2 For the scenario of importing additional data , When there is a table structure and historical data , Manual segmentation is required region
We can use split-table-region To deal with the problem of writing hotspots .

It should be noted that , The statement of creating table is added shard_row_id_bits and pre_split_regions After the parameter , You need to start importing data to the target table as soon as possible , Otherwise, the partition is empty region May be merge, Lead to write hot issues .
After solving the hot issue of writing , You also need to consider the insertion of a large amount of data , May lead to TiDB appear OOM The situation of .
2、 Avoid big business OOM
stay TiDB In the cluster , When inserting data , Every data will occupy tidb-server Node memory , When the amount of data inserted in a transaction is too large , It will lead to tidb-server The node is using too much memory , appear OOM The situation of . To solve the problem of big business , You need to turn on several parameters :
Add :
enable-batch-dml: trueSET @@GLOBAL.tidb_dml_batch_size = 20000;SET @@SESSION.tidb_batch_insert = 1;
These three parameters need to be used together , It means in a insert In the transaction of ,TiDB Will put a insert The big business of , Press 20000 Pieces of data are split into small transactions and submitted in batches , This is a good way to avoid tidb-server node OOM The situation of , But there will be some problems in doing so : No guarantee TiDB Atomicity and isolation requirements of , So it's not recommended . This requires everyone to choose according to their actual situation .
3、 ... and 、 File import test
1. Selection of import tools
1.1 Navicat
It has a very good interactive interface , User learning costs are almost 0, You can import data directly ,Navicat Internally, the data in the file will be executed in batches , It's not slow , But only in windows Operating in the system .
1.2 load data
Mysql General batch import method in , You can specify the hexadecimal separator ,TiDB Middle Division LOAD DATA...REPLACE INTO Beyond grammar ,LOAD DATA Statements should be fully compatible MySQL.
1.3 Lightning
It is a highly recommended tool for logical batch data import , Directly convert the data into key value pairs , And insert into TiKV in , So it is the fastest way to import data in theory , But use lightning There are also some inconvenient places , For example, multiple imports require frequent modification of configuration files , The file name must be “ Library name . Table name ” In the form of , Cannot import txt File data , If the import fails, you need to modify it manually TiKV Cluster is in normal mode, etc .
2. Data preparation
I prepared four different data volumes for this test csv file , File size in 100MB To 11GB :

The structure of the table is :
CREATE TABLE `sbtest` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`k` int(11) NOT NULL DEFAULT '0',
`c` char(120) NOT NULL DEFAULT '',
`pad` char(60) NOT NULL DEFAULT '',
PRIMARY KEY (`id`) /*T![clustered_index] NONCLUSTERED */,
KEY `k_1` (`k`)
) shard_row_id_bits = 4 pre_split_regions = 4 ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
3.TiDB Cluster ready

Each machine is equipped with 8 vCore,32G Memory , Gigabit bandwidth . Limited configuration , All tests are conducted in this cluster environment .
4. test result
| Navicat | load data | Lightning (local) | Lightning (tidb) | |
|---|---|---|---|---|
| 100W | 195.91( second ) | 42.23( second ) | 24.76( second ) | 103.14( second ) |
| 400W | 732.47( second ) | 180.99( second ) | 87.01( second ) | 399.44( second ) |
| 1000W | 1811.8( second ) | 463.42( second ) | 190.71( second ) | 978.6( second ) |
| 6000W | 11041.87( second ) | 2831.83( second ) | 1908.05( second ) | 6081.39( second ) |

5. Analysis of test results
1、 In addition to using Lightning Of local Pattern import data , Other import methods will more or less have write hotspots .
2、 Use Navicat Import data , The consumption of disk resources is much higher than other methods , And the speed is the slowest .
3、load data How to import , The consumption of existing resources is not high , Want to improve import efficiency , You can only import data manually in parallel .
4、Lightning Of local The mode imports data fastest , The resource consumption of the cluster is minimal .
5、Lightning Of tidb The consumption of resources by patterns is related to load data The import method of is similar , But the efficiency is lower than load data How to import .
6、 During the import process ,TiDB The memory used by the cluster is normal , None of them appeared OOM The situation of .
7、 A single file is too large , It will affect Lightning Of local The processing efficiency of the mode for files .
8、Lightning Of local Patterns of Lightning Of the machine CPU、 Hard disk and cluster network requirements are high .
9、Lightning When reading the file name , The case of the file name should be exactly the same as that of the target table , otherwise Lightning Will report a mistake .
Four 、 summary
Lightning More restrictions , But the function is also the most complete ,local The import speed of the mode suspends other import tools .
load data Support txt Import of file data , And you can also customize some general parameters , Such as field separator , The import speed is second only to Lightning Of local Pattern . be based on load data Develop a script that meets the business requirements , It's better than using it directly Lightning It's much easier to develop .
For the needs Lightning The function of , And incremental data import , It is recommended to use Lightning Of tidb Pattern .
Navicat It is only recommended for beginners to import a small amount of data .

Copyright notice : This article is organized and written by the team of Digital China cloud base , If reproduced, please indicate the source .
Official account search for digital cloud base in China , The background to reply TiDB, Join in TiDB Technology exchange group !
边栏推荐
- 58岁安徽人,干出瑞士今年最大IPO 投资界
- Baklib | why do enterprises need to pay attention to customer experience?
- 《软件设计师考试》易混淆知识点
- Moco V1: the visual field can also be self supervised
- Observer mode, object pool
- Alibaba cloud MSE supports go language traffic protection
- Job CE
- mysql还有哪些自带的函数呢?别到处找了,看这个就够了。
- 图书馆借阅系统「建议收藏」
- Cause analysis of restart of EMC cx4-120 SPB controller
猜你喜欢

融合数据库生态:利用 EventBridge 构建 CDC 应用

It is not only convenient, safe + intelligent, but also beautiful. Fluorite releases the Big Dipper face lock dl30f and Aurora face video lock y3000fv

Unity - script lifecycle

《软件设计师考试》易混淆知识点

Job CE

protobuf 中基础数据类型的读写

Moco V3: visual self supervision ushers in transformer

Eureka相互注册,只显示对方或只在一个中显示问题

How to modify the ID of NetApp expansion enclosure disk shelf

MoCo V3:视觉自监督迎来Transformer
随机推荐
Interpretation of ue4.25 slate source code
[tool class] util package of map, common entity classes are converted to map and other operations
Integrating database Ecology: using eventbridge to build CDC applications
Report redirect after authorized login on wechat official account_ The problem of wrong URI parameters
Interesting pictures and words
Thinking and summary of R & D Efficiency
ntp服务器 时间(查看服务器时间)
The development of smart home industry pays close attention to edge computing and applet container technology
source insight 使用快捷键
C # basic 3-value type and reference type, packing and unpacking
Observer mode, object pool
【input 身份证号】星号 代替,input 切割成 多个 小格格(类似)
Unity3d tutorial notes - unity initial 04
九鑫智能正式加入openGauss社区
Alibaba cloud MSE supports go language traffic protection
Moco V3: visual self supervision ushers in transformer
C # basic 4-written examination question 1
Unity3d tutorial notes - unity initial 02
Introduction to blue team: efficiency tools
Space shooting lesson 14: player life

