当前位置:网站首页>How does MySQL archive data?
How does MySQL archive data?
2022-07-28 21:56:00 【JavaShark】
It usually involves the following two actions :
transfer : Migrate data from business instance to archive instance .
Delete : Delete migrated data from the business instance .
When dealing with similar requirements , They all developed children's shoes for DBA, from DBA To deal with it .
therefore , Many developers of children's shoes are curious ,DBA How to perform archiving operations ? Does the archiving condition lock the table without an index ? Is it safe? , Will the data be deleted , But it didn't file successfully ?
In response to these questions , Let's introduce MySQL Data archiving artifact in ——pt-archiver.
One 、 What is? pt-archiver
pt-archiver yes Percona Toolkit One of the tools in .
Percona Toolkit yes Percona One provided by the company MySQL tool kit .
There are many practical tools in the toolkit MySQL Management tools .
for example , Our common table structure change tool pt-online-schema-change, Master slave data consistency verification tool pt-table-checksum.
It's no exaggeration to say , Skillfully use Percona Toolkit yes MySQL DBA One of the essential skills .
Two 、 install
Percona Toolkit Download address :https://www.percona.com/downloads/percona-toolkit/LATEST/

The official provides ready-made software packages for multiple systems .
What I often use is Linux - Generic Binary package .
Let's say Linux - Generic Version as an example , See how it's installed .
# cd /usr/local/
# wget https://downloads.percona.com/downloads/percona-toolkit/3.3.1/binary/tarball/percona-toolkit-3.3.1_x86_64.tar.gz --no-check-certificate
# tar xvf percona-toolkit-3.3.1_x86_64.tar.gz
# cd percona-toolkit-3.3.1
# yum install perl-ExtUtils-MakeMaker perl-DBD-MySQL perl-Digest-MD5
# perl Makefile.PL
# make
# make install
3、 ... and 、 Simple introduction
First , Let's look at a simple archive Demo.
Test data
mysql> show create table employees.departments\G
*************************** 1. row ***************************
Table: departments
Create Table: CREATE TABLE `departments` (
`dept_no` char(4) NOT NULL,
`dept_name` varchar(40) NOT NULL,
PRIMARY KEY (`dept_no`),
UNIQUE KEY `dept_name` (`dept_name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)
mysql> select * from employees.departments;
+---------+--------------------+
| dept_no | dept_name |
+---------+--------------------+
| d009 | Customer Service |
| d005 | Development |
| d002 | Finance |
| d003 | Human Resources |
| d001 | Marketing |
| d004 | Production |
| d006 | Quality Management |
| d008 | Research |
| d007 | Sales |
+---------+--------------------+
9 rows in set (0.00 sec)
below , We will employees.departments Table data from 192.168.244.10 Archive to 192.168.244.128.
The specific command is as follows :
pt-archiver --source h=192.168.244.10,P=3306,u=pt_user,p=pt_pass,D=employees,t=departments --dest h=192.168.244.128,P=3306,u=pt_user,p=pt_pass,D=employees,t=departments --where "1=1"
Three parameters are specified on the command line .
--source: Source library ( Business examples ) Of DSN.
DSN stay Percona Toolkit Common in , It can be understood as the abbreviation of information related to the target instance .
Supported abbreviations and meanings are as follows :

--dest: Target library ( Archive instances ) Of DSN.
--where: Archive conditions ."1=1" On behalf of filing the full table .
Four 、 Realization principle
Let's combine General log Take a look at the output of pt-archiver Implementation principle of .
Source library log
2022-03-06T10:58:20.612857+08:00 10 Query SELECT /*!40001 SQL_NO_CACHE */ `dept_no`,`dept_name` FROM `employees`.`departments` FORCE INDEX(`PRIMARY`) WHERE (1=1) ORDER BY `dept_no` LIMIT 1
2022-03-06T10:58:20.613451+08:00 10 Query DELETE FROM `employees`.`departments` WHERE (`dept_no` = 'd001')
2022-03-06T10:58:20.620327+08:00 10 Query commit
2022-03-06T10:58:20.628409+08:00 10 Query SELECT /*!40001 SQL_NO_CACHE */ `dept_no`,`dept_name` FROM `employees`.`departments` FORCE INDEX(`PRIMARY`) WHERE (1=1) AND ((`dept_no` >= 'd001')) ORDER BY `dept_no` LIMIT 1
2022-03-06T10:58:20.629279+08:00 10 Query DELETE FROM `employees`.`departments` WHERE (`dept_no` = 'd002')
2022-03-06T10:58:20.636154+08:00 10 Query commit
...
Target library log
2022-03-06T10:58:20.613144+08:00 18 Query INSERT INTO `employees`.`departments`(`dept_no`,`dept_name`) VALUES ('d001','Marketing')
2022-03-06T10:58:20.613813+08:00 18 Query commit
2022-03-06T10:58:20.628843+08:00 18 Query INSERT INTO `employees`.`departments`(`dept_no`,`dept_name`) VALUES ('d002','Finance')
2022-03-06T10:58:20.629784+08:00 18 Query commit
...
Combine the logs of the source library and the target library , You can see :
1)pt-archiver First, a record will be queried from the source database , Then insert the record into the target library .
Target library inserted successfully , Will delete this record from the source library .
This ensures that the data is deleted before , It must be archived successfully .
2) Carefully observe the execution time of these operations , The order is as follows .
Source database query record .
Insert record in target library .
Delete record from source library .
Target library COMMIT.
Source library COMMIT.
This implementation draws lessons from the two-stage commit algorithm in distributed transactions .
3)--where In the parameter "1=1" It will be delivered to SELECT In operation .
"1=1" On behalf of filing the full table , Other conditions can also be specified , Such as the time we often use .
4) Every query uses the primary key index , In this way, even if there is no index in the archive condition , Nor will it produce a full table scan .
5) Every deletion is based on the primary key , This can avoid the risk of locking the whole table due to the lack of index in the archiving condition .
5、 ... and 、 Batch archiving
If you use Demo Archive the parameters in , In the case of large amount of data , It's going to be very inefficient , After all COMMIT It's an expensive operation .
So online , We usually do batch operations .
The specific command is as follows :
pt-archiver --source h=192.168.244.10,P=3306,u=pt_user,p=pt_pass,D=employees,t=departments --dest h=192.168.244.128,P=3306,u=pt_user,p=pt_pass,D=employees,t=departments --where "1=1" --bulk-delete --limit 1000 --commit-each --bulk-insert
Compared with the previous archive command , This command specifies four additional parameters , among ,
--bulk-delete: Batch deletion .
--limit: Number of records filed per batch .
--commit-each: For each batch of records , It's just COMMIT once .
--bulk-insert: Archive data to LOAD DATA INFILE Import into the archive in the way of .
Look at the corresponding... Of the above command General log.
Source library
2022-03-06T12:13:56.117984+08:00 53 Query SELECT /*!40001 SQL_NO_CACHE */ `dept_no`,`dept_name` FROM `employees`.`departments` FORCE INDEX(`PRIMARY`) WHERE (1=1) ORDER BY `dept_no` LIMIT 1000
...
2022-03-06T12:13:56.125129+08:00 53 Query DELETE FROM `employees`.`departments` WHERE (((`dept_no` >= 'd001'))) AND (((`dept_no` <= 'd009'))) AND (1=1) LIMIT 1000
2022-03-06T12:13:56.130055+08:00 53 Query commit
Target library
2022-03-06T12:13:56.124596+08:00 51 Query LOAD DATA LOCAL INFILE '/tmp/hitKctpQTipt-archiver' INTO TABLE `employees`.`departments`(`dept_no`,`dept_name`)
2022-03-06T12:13:56.125616+08:00 51 Query commit:
Be careful :
1) If you want to execute LOAD DATA LOCAL INFILE operation , The target library needs to be local_infile Parameter set to ON.
2) If you don't specify --bulk-insert And did not specify --commit-each, The insertion of the target library will still look like Demo As shown in , Submit line by line .
3) If you don't specify --commit-each, Even if... In the table 9 A record is passed through a DELETE Command deleted , But because it involves 9 Bar record ,pt-archiver Will execute COMMIT operation 9 Time . The same is true for the target library .
4) In the use of --bulk-insert When filing, pay attention to , If there is a problem during the import process , For example, primary key conflicts ,pt-archiver No error will be prompted .
6、 ... and 、 Speed comparison between different archiving parameters
The following table is the archive 20w data , Comparison between different execution time parameters .

Through the data in the table , We can come to the following points :
1) The first way is the slowest .
In this case , Whether it's a source library or an archive library , They are operated and submitted line by line .
2) Specify only --bulk-delete --limit 1000 Still very slow .
In this case , The source library is deleted in batch , but COMMIT The number of times has not decreased .
The archive library is still inserted and submitted line by line .
3)--bulk-delete --limit 1000 --commit-each
Equivalent to the second archiving method , Both source and target libraries are submitted in batches .
4)--limit 1000 and --limit 5000 Archiving performance is similar .
5)--bulk-delete --limit 1000 --bulk-insert And --bulk-delete --limit 1000 --commit-each --bulk-insert comparison , No settings --commit-each.
Although they are all batch operations , But the former will perform COMMIT operation 1000 Time .
From this view , Empty transactions are not without cost .
7、 ... and 、 Other common usages
1、 Delete data
Deleting data is pt-archiver Another common usage scenario .
The specific command is as follows :
pt-archiver --source h=192.168.244.10,P=3306,u=pt_user,p=pt_pass,D=employees,t=departments --where "1=1" --bulk-delete --limit 1000 --commit-each --purge --primary-key-only
On the command line --purge Means only delete , No archiving .
It specifies --primary-key-only , such , In execution SELECT In operation , Only the primary key will be queried , Not all columns will be queried .
Next , Let's look at the delete command related General log.
To visually show pt-archiver Implementation logic of deleting data , The actual test will --limit Set up in order to 3.
# Open transaction
set autocommit=0;
# View table structure , Get primary key
SHOW CREATE TABLE `employees`.`departments`;
# Start deleting the first batch of data
# adopt FORCE INDEX(`PRIMARY`) Force primary key
# It specifies --primary-key-only, Therefore, only the primary key will be queried
# In fact, there is no need to obtain all the qualified primary key values , Just take a minimum and maximum value .
SELECT /*!40001 SQL_NO_CACHE */ `dept_no` FROM `employees`.`departments` FORCE INDEX(`PRIMARY`) WHERE (1=1) ORDER BY `dept_no` LIMIT 3;
# Delete based on primary key , When deleting, you also bring --where Specified deletion criteria , To avoid accidental deletion
DELETE FROM `employees`.`departments` WHERE (((`dept_no` >= 'd001'))) AND (((`dept_no` <= 'd003'))) AND (1=1) LIMIT 3;
# Submit
commit;
# Delete the second batch of data
SELECT /*!40001 SQL_NO_CACHE */ `dept_no` FROM `employees`.`departments` FORCE INDEX(`PRIMARY`) WHERE (1=1) AND ((`dept_no` >= 'd003')) ORDER BY `dept_no` LIMIT 3;
DELETE FROM `employees`.`departments` WHERE (((`dept_no` >= 'd004'))) AND (((`dept_no` <= 'd006'))) AND (1=1); LIMIT 3
commit;
# Delete the third batch of data
SELECT /*!40001 SQL_NO_CACHE */ `dept_no` FROM `employees`.`departments` FORCE INDEX(`PRIMARY`) WHERE (1=1) AND ((`dept_no` >= 'd006')) ORDER BY `dept_no` LIMIT 3;
DELETE FROM `employees`.`departments` WHERE (((`dept_no` >= 'd007'))) AND (((`dept_no` <= 'd009'))) AND (1=1) LIMIT 3;
commit;
# Delete the last batch of data
SELECT /*!40001 SQL_NO_CACHE */ `dept_no` FROM `employees`.`departments` FORCE INDEX(`PRIMARY`) WHERE (1=1) AND ((`dept_no` >= 'd009')) ORDER BY `dept_no` LIMIT 3;
commit;
In the business code , If we have similar deletion requirements , You might as well learn from pt-archiver How to implement .
2、 Archive data into files
Data can be archived to the database , It can also be archived in a file .
The specific command is as follows :
pt-archiver --source h=192.168.244.10,P=3306,u=pt_user,p=pt_pass,D=employees,t=departments --where "1=1" --bulk-delete --limit 1000 --file '/tmp/%Y-%m-%d-%D.%t'
Specifies the --file , instead of --dest.
The file name uses the date formatting symbol , The supported symbols and meanings are as follows :
%d Day of the month, numeric (01..31)
%H Hour (00..23)
%i Minutes, numeric (00..59)
%m Month, numeric (01..12)
%s Seconds (00..59)
%Y Year, numeric, four digits
%D Database name
%t Table name
The generated file is CSV Format , It can be passed later LOAD DATA INFILE Command to load into the database .
8、 ... and 、 How to avoid master-slave delay
Whether it's data archiving or deletion , For source libraries , All need to be carried out DELETE operation .
Many people worry , If too many records are deleted , Will cause master-slave delay .
in fact ,pt-archiver It has the ability to automatically adjust archiving based on master-slave delay ( Delete ) Ability to operate .
If the delay from the library exceeds 1s( from --max-lag Appoint ) Or replication status is not normal , The archive will be suspended ( Delete ) operation , Until recovered from the library .
By default ,pt-archiver The delay from the library will not be checked .
If you want to check , Need to pass --check-slave-lag Explicitly set the address of the slave Library , for example ,
pt-archiver --source h=192.168.244.10,P=3306,u=pt_user,p=pt_pass,D=employees,t=departments --where "1=1" --bulk-delete --limit 1000 --commit-each --primary-key-only --purge --check-slave-lag h=192.168.244.20,P=3306,u=pt_user,p=pt_pass
Only check here 192.168.244.20 The delay of .
If there are multiple slave libraries to check , Need to --check-slave-lag Specify multiple times , One slave library at a time .
Nine 、 Common parameters
--analyze
After performing the archiving operation , perform ANALYZE TABLE operation .
Can be followed by any string , If the string contains s , Will execute... In the source library ANALYZE operation .
If the string contains d , Will execute... In the target library ANALYZE operation .
If you have d and s , Then both the source library and the target library will execute ANALYZE operation . Such as ,
--analyze ds
--optimize
After performing the archiving operation , perform OPTIMIZE TABLE operation .
Use the same --analyze similar .
--charset
Specify the connection (Connection) Character set .
stay MySQL 8.0 Before , The default is latin1.
stay MySQL 8.0 in , The default is utf8mb4 .
Be careful , The default value here is the same as MySQL Server character set character_set_server irrelevant .
If this value is explicitly set ,pt-archiver After the connection is established , Will execute first SET NAMES 'charset_name' operation .
--[no]check-charset
Check the source library ( Target library ) Connect (Connection) Whether the character set is consistent with the character set of the table .
If it's not consistent , The following errors will be prompted :
Character set mismatch: --source DSN uses latin1, table uses gbk. You can disable this check by specifying --no-check-charset.
This is the time , Remember not to follow the prompts to specify --no-check-charset Ignore check , Otherwise, it is easy to cause garbled code .
For the above error report , Can be --charset The character set specified as the table .
Be careful , This option does not compare whether the character sets of the source library and the target library are consistent .
--[no]check-columns
Check whether the column names of the source table and the target table are consistent .
Be careful , Only column names will be checked , The order of the columns is not checked 、 Whether the data types of columns are consistent .
--columns
Archive specified columns .
In the case of self addition , If the self incrementing columns of the source table and the target table intersect , It is not necessary to file and add columns , This is the time , You need to use --columns Explicitly specify archive Columns .
--dry-run
Print only those to be executed SQL, Not actually .
Often used before actual operation , Verify the to be executed SQL Whether they meet their expectations .
--ignore
Use INSERT IGNORE Archive data .
--no-delete
Do not delete the data of the source library .
--replace
Use REPLACE Operate archive data .
--[no]safe-auto-increment
When archiving a table with a self incrementing primary key , The row with the largest self incremented primary key will not be deleted by default .
To do so , Mainly to avoid MySQL 8.0 Previously, the self incrementing primary key cannot be persisted .
When archiving the whole table , This needs attention .
If you need to delete , Need to specify --no-safe-auto-increment.
--source
Give the information of the source instance .
In addition to the commonly used options , It also supports the following options :
a: Specify the default database for the connection .
b: Set up SQL_LOG_BIN=0 .
If it is specified in the source library , be DELETE The operation will not be written to Binlog in .
If it is specified in the target library , be INSERT The operation will not be written to Binlog in .
i: Set the index used by the archive operation , The default is primary key .
--progress
Show progress information , Number of rows per unit .
Such as --progress 10000, Then every file ( Delete )10000 That's ok , Just print the progress information once .
TIME ELAPSED COUNT
2022-03-06T18:24:19 0 0
2022-03-06T18:24:20 0 10000
2022-03-06T18:24:21 1 20000
The first column is the current time , The second column is the elapsed time , The third column is archived ( Delete ) The number of rows .
Ten 、 summary
front , We compared the execution time of different parameters in the archiving operation .
among ,--bulk-delete --limit 1000 --commit-each --bulk-insert It's the fastest . Not specifying any batch operation parameters is the slowest .
But in the use of --bulk-insert Pay attention to , If there is a problem during the import process ,pt-archiver No error will be prompted .
Common errors are primary key conflicts , The data type of the data and the target column are inconsistent .
If not used --bulk-insert, But by default INSERT Operation to archive , Most errors can be identified .
for example , Primary key conflict , The following errors will be prompted .
DBD::mysql::st execute failed: Duplicate entry 'd001' for key 'PRIMARY' [for Statement "INSERT INTO `employees`.`departments`(`dept_no`,`dept_name`) VALUES (?,?)" with ParamValues: 0='d001', 1='Marketing'] at /usr/local/bin/pt-archiver line 6772.
The imported data is inconsistent with the data type of the target column , The following errors will be prompted .
DBD::mysql::st execute failed: Incorrect integer value: 'Marketing' for column 'dept_name' at row 1 [for Statement "INSERT INTO `employees`.`departments`(`dept_no`,`dept_name`) VALUES (?,?)" with ParamValues: 0='d001', 1='Marketing'] at /usr/local/bin/pt-archiver line 6772.
Of course , Inconsistent data and type , The premise that can be identified is to archive the instance SQL_MODE For the strict model .
If the instance to be archived has MySQL 5.6, In fact, it is difficult for us to archive the SQL_MODE Turn on to strict mode .
because MySQL 5.6 Of SQL_MODE The default is non strict mode , Therefore, it is inevitable to produce a lot of invalid data , For example, in the time field 0000-00-00 00:00:00 .
This invalid data , If you insert it into an archive instance with strict mode turned on , Will report an error directly .
From the perspective of data security , The most recommended filing method is :
1) Archive first , But do not delete the data of the source library .
2) Compare whether the data of source database and archive database are consistent .
3) If the comparison results are consistent , Then delete the archived data of the source library .
among , The first and third steps can be through pt-archiver Get it done , The second step can be through pt-table-sync Get it done .
This method of deleting while archiving , Although it's a lot of trouble , But relatively speaking , More secure .
边栏推荐
- kali里的powersploit、evasion、weevely等工具的杂项记录
- Edited by vimtutor
- 日志瘦身神操作:从5G优化到1G到底是怎么做到的!(荣耀典藏版)
- [geek challenge 2019] secret file & file contains common pseudo protocols and gestures
- JVM 内存布局详解(荣耀典藏版)
- kingbase中指定用户默认查找schema,或曰用户无法使用public schema下函数问题
- 如何优雅的设计工作流引擎(荣耀典藏版)
- How to search images efficiently and accurately? Look at the lightweight visual pre training model
- Standard C language learning summary 10
- LeetCode链表问题——面试题02.07.链表相交(一题一文学会链表)
猜你喜欢

搞事摸鱼一天有一天

msfvenom制作主控与被控端

Leetcode linked list problem -- 142. circular linked list II (learn the linked list by one question and one article)

An end-to-end aspect level emotion analysis method for government app reviews based on brnn

Week 6 Linear Models for Classification (Part B)

Storage and steps of phospholipid coupled antibody / protein Kit

基于BRNN的政务APP评论端到端方面级情感分析方法

Matlab | basic knowledge summary I

节省70%的显存,训练速度提高2倍!浙大&阿里提出在线卷积重新参数化OREPA,代码已开源!(CVPR 2022 )
![Leetcode 142. circular linked list II [knowledge points: speed pointer, hash table]](/img/74/321a4a0fab0b0dbae53b2ea1faf814.png)
Leetcode 142. circular linked list II [knowledge points: speed pointer, hash table]
随机推荐
基于对象的实时空间音频渲染丨Dev for Dev 专栏
The University was abandoned for three years, the senior taught himself for seven months, and found a 12K job
Which brand is the best and most cost-effective open headset
Cross domain transfer learning of professional skill word extraction in Chinese recruitment documents
Versailles ceiling: "the monthly salary of two years after graduation is only 35K, which is really unpromising ~ ~"
kali里的powersploit、evasion、weevely等工具的杂项记录
高举5G和AI两面旗帜:紫光展锐市场峰会火爆申城
[brother hero July training] day 28: dynamic planning
Record some small requirements in the form of cases
The general strike of three factories in St France may make the shortage of chips more serious!
微信小程序开发入门,自己开发小程序
标准C语言学习总结10
Two global variables__ Dirname and__ Further introduction to common functions of filename and FS modules
[英雄星球七月集训LeetCode解题日报] 第28日 动态规划
Storage and steps of phospholipid coupled antibody / protein Kit
Implementation of sequence table
磷脂偶联抗体/蛋白试剂盒的存储与步骤
Week 6 Linear Models for Classification (Part B)
What is the purpose of database read-write separation [easy to understand]
C process control statement