当前位置：网站首页>[tidb] importing TXT documents into the database is really efficient

[tidb] importing TXT documents into the database is really efficient

2022-07-28 21:09:00 【Digital China cloud base】

Catalog

One 、 Preface

In the project , We often encounter the need of data import , For example, put txt perhaps csv Data in the file is imported into TiDB In the cluster . This was originally a simple thing , Until a bank project , The customer gives the time to import data , Limited a very short period , Invisible let me increase some pressure . Only then did I consider , In some businesses , This kind of data distributed in the form of documents , In a very short time , Complete the database import . That's why I started to study how to speed up 、 More secure data import to TiDB in .

Two 、TiDB Cluster preparation

Before importing data ,TiDB Of course, you need to prepare ~
In case of a write hotspot , Or the transaction is too large, causing the import to be too slow 、 Even fail , Which leads to , Data import is not complete , Isn't that embarrassing ...

1、 Avoid writing hotspots

In the scenario of massive data insertion , Probably there will be hot spots , In this case , We must make a response in advance .

1.1 Data insertion for new tables , We can create the table structure , Add two parameters

Example ：

CREATE TABLE `t1` (
  `id` int(11) NOT NULL,
) shard_row_id_bits = 4 pre_split_regions = 4;

1.2 For the scenario of importing additional data , When there is a table structure and historical data , Manual segmentation is required region

We can use split-table-region To deal with the problem of writing hotspots .

It should be noted that , The statement of creating table is added shard_row_id_bits and pre_split_regions After the parameter , You need to start importing data to the target table as soon as possible , Otherwise, the partition is empty region May be merge, Lead to write hot issues .

After solving the hot issue of writing , You also need to consider the insertion of a large amount of data , May lead to TiDB appear OOM The situation of .

2、 Avoid big business OOM

stay TiDB In the cluster , When inserting data , Every data will occupy tidb-server Node memory , When the amount of data inserted in a transaction is too large , It will lead to tidb-server The node is using too much memory , appear OOM The situation of . To solve the problem of big business , You need to turn on several parameters ：

Add ：enable-batch-dml: true
SET @@GLOBAL.tidb_dml_batch_size = 20000;
SET @@SESSION.tidb_batch_insert = 1;

These three parameters need to be used together , It means in a insert In the transaction of ,TiDB Will put a insert The big business of , Press 20000 Pieces of data are split into small transactions and submitted in batches , This is a good way to avoid tidb-server node OOM The situation of , But there will be some problems in doing so ： No guarantee TiDB Atomicity and isolation requirements of , So it's not recommended . This requires everyone to choose according to their actual situation .

3、 ... and 、 File import test

1. Selection of import tools

1.1 Navicat

It has a very good interactive interface , User learning costs are almost 0, You can import data directly ,Navicat Internally, the data in the file will be executed in batches , It's not slow , But only in windows Operating in the system .

1.2 load data

Mysql General batch import method in , You can specify the hexadecimal separator ,TiDB Middle Division LOAD DATA...REPLACE INTO Beyond grammar ,LOAD DATA Statements should be fully compatible MySQL.

1.3 Lightning

It is a highly recommended tool for logical batch data import , Directly convert the data into key value pairs , And insert into TiKV in , So it is the fastest way to import data in theory , But use lightning There are also some inconvenient places , For example, multiple imports require frequent modification of configuration files , The file name must be “ Library name . Table name ” In the form of , Cannot import txt File data , If the import fails, you need to modify it manually TiKV Cluster is in normal mode, etc .

2. Data preparation

I prepared four different data volumes for this test csv file , File size in 100MB To 11GB ：

The structure of the table is ：

CREATE TABLE `sbtest` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `k` int(11) NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`) /*T![clustered_index] NONCLUSTERED */,
  KEY `k_1` (`k`)
) shard_row_id_bits = 4 pre_split_regions = 4 ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

3.TiDB Cluster ready

Each machine is equipped with 8 vCore,32G Memory , Gigabit bandwidth . Limited configuration , All tests are conducted in this cluster environment .

4. test result

	Navicat	load data	Lightning (local)	Lightning (tidb)
100W	195.91（ second ）	42.23（ second ）	24.76（ second ）	103.14（ second ）
400W	732.47（ second ）	180.99（ second ）	87.01（ second ）	399.44（ second ）
1000W	1811.8（ second ）	463.42（ second ）	190.71（ second ）	978.6（ second ）
6000W	11041.87（ second ）	2831.83（ second ）	1908.05（ second ）	6081.39（ second ）

5. Analysis of test results

1、 In addition to using Lightning Of local Pattern import data , Other import methods will more or less have write hotspots .

2、 Use Navicat Import data , The consumption of disk resources is much higher than other methods , And the speed is the slowest .

3、load data How to import , The consumption of existing resources is not high , Want to improve import efficiency , You can only import data manually in parallel .

4、Lightning Of local The mode imports data fastest , The resource consumption of the cluster is minimal .

5、Lightning Of tidb The consumption of resources by patterns is related to load data The import method of is similar , But the efficiency is lower than load data How to import .

6、 During the import process ,TiDB The memory used by the cluster is normal , None of them appeared OOM The situation of .

7、 A single file is too large , It will affect Lightning Of local The processing efficiency of the mode for files .

8、Lightning Of local Patterns of Lightning Of the machine CPU、 Hard disk and cluster network requirements are high .

9、Lightning When reading the file name , The case of the file name should be exactly the same as that of the target table , otherwise Lightning Will report a mistake .

Four 、 summary

Lightning More restrictions , But the function is also the most complete ,local The import speed of the mode suspends other import tools .
load data Support txt Import of file data , And you can also customize some general parameters , Such as field separator , The import speed is second only to Lightning Of local Pattern . be based on load data Develop a script that meets the business requirements , It's better than using it directly Lightning It's much easier to develop .
For the needs Lightning The function of , And incremental data import , It is recommended to use Lightning Of tidb Pattern .
Navicat It is only recommended for beginners to import a small amount of data .

Copyright notice : This article is organized and written by the team of Digital China cloud base , If reproduced, please indicate the source .
Official account search for digital cloud base in China , The background to reply TiDB, Join in TiDB Technology exchange group ！