Relay log performance optimization practice in DM

Preface
Relay log similar binary log, Refers to a set of files containing database change events , Plus related index and mata file , Specific details refer to Official documents . stay DM Open... For an upstream in relay log after , Compared with not opening , It has the following advantages ：

Don't open relay log when , Every subtask Will connect to the upstream database and pull binlog data , It will put great pressure on the upstream database , And when it's turned on , Just create a connection pull binlog Data to local , each subtask Can read local relay log data .
Upstream database pair binlog There is usually a failure time , Or take the initiative purge binlog, To clear the space . In an open relay log when , If DM The synchronization progress is relatively backward , once binlog Cleaned up , Will cause synchronization failure , Only full migration can be performed again ; Turn on relay log after ,binlog The data will be pulled in real time and written to the local server , It has nothing to do with the current synchronization progress , It can effectively avoid the previous problems .
But in DM edition <= v2.0.7 in , Turn on relay log There are the following questions after ：

The data synchronization delay is not turned on relay log There has been a marked increase , The following table is a single task Of benchmark Test results , It can be seen that the average latency There has been a marked increase . In the following table, use “.” The beginning is the delayed percentile data .

Turn on relay after CPU Consumption increases .（ because latency The growth of , In some simple scenarios （ Such as the only 1 individual task） Compared with not opening relay log, On the contrary, resource utilization is declining . But when task When increasing , Turn on relay Of CPU Consumption increases ）.
Because of the above problems , In the new version , We are right. DM Of relay log Some performance optimizations were made .

At present relay Realization
Before we begin to introduce specific optimizations , First, a brief introduction to the current DM in relay The implementation of the , Detailed implementation , For details, please refer to DM Source reading series of articles （ 6、 ... and ）relay log The implementation of the , This article will not describe too much here .

At present relay The module can be divided into two parts ,relay writer and relay reader part , Its structure is as follows ：

Relay writer
relay writer The following are done in turn 3 One thing ：

Use binlog reader From upstream MySQL/MariaDB Read binlog event;

Will read to binlog event Use binlog transformer convert ;

Will be converted binlog event Use binlog writer With relay log file Stored locally in the form of relay directory in .

Relay reader
Turn on relay after ,Syncer Will pass relay reader obtain binlog event,relay reader The main work is as follows ：

Read realy directory Medium binlog file , Send to syncer;
When the end of the file is read , timing （ Now it's every 100ms） Check current binlog File size and meta Whether there is any change in the content of the document , If changed, continue reading （binlog Document changes ） Or switch to a new file （meta Document changes ）.
As can be seen from the above introduction ,relay reader Follow relay writer It's independent of each other , Through each other relay directory Medium binlog、meta and index File interaction .

Test environment description
Before we start introducing the optimization , First introduce the environment used in optimization

Upstream MySQL, Version is 5.7.35-log;

The downstream is a single instance TiDB, Version is 5.7.25-TiDB-v5.2.1;

DM Used 1 individual master and 1 individual worker

Latency The benchmark version is 2021-10-14 The no. master Branch （commit hash by d2dc22d）
CPU The benchmark version is 2021-11-01 The no. relay Refactoring branches （commit hash by 9f7ce1d6）
On a small scale （<= 4 task） Next MySQL/TiDB/DM Running on the same host , Convenient test , The host environment is 8 nucleus 16G;

Large scale test （> 4 task） Next MySQL/TiDB/DM Run on one host , The hosts are 8 nucleus 16G

The migration delay is measured by downstream self updating time series , A detailed reference There are many ways to tell you how to calculate DM Synchronize data to TiDB The delay time of The third way in .

Latency Optimize
From the above “ At present relay Realization ” As can be seen in part , There are two possible effects latency The point of ：

At present relay reader The timing of check The implementation of the method itself will be right latency There is a certain influence , In the worst case, one binlog event At least delay 100ms To synchronize to the downstream ;
relay writer Can write to disk , after relay reader Read from disk , Whether this reading and writing will be right latency Have a greater impact ？
Research findings linux System （Mac OS There is a similar mechanism under the system ） There is page cache The mechanism of , Reading the most recently written file is not through disk , But read OS Cache in memory , Therefore, the theoretical impact is limited .
After investigation and testing of the above problems , We summarized two schemes ：

Cache in memory relay writer Prepared for writing in the latest period of time binlog, If relay reader Request this paragraph binlog,relay reader Read directly from memory ;
relay reader Still use the way of reading files ,relay writer Writing new event when , notice relay reader.
programme 1 You need to switch between reading memory and reading files according to the downstream write speed , The implementation is relatively complex , And because of OS layer page cache The existence of , Add another layer of cache to the application itself latency The impact is limited .

programme 2, We did some preliminary tests , On the increase relay reader check The frequency of time , Turn on relay Basically, it can not be opened relay At the time of the latency, I did some research MySQL Of relay log, Discovery is also by reading files , So we chose the scheme 2.

The implementation is relatively simple , stay relay writer Added Listener, There's something new here binlog event Notify the when Listener（ Go to channel Send a message in ）, And then in relay reader in , Will be timed check Change to monitor channel The messages in the .

The picture below is in 4 table 4 task Under test latency result , It can be seen that relay After latency It's very close to not opening ：

CPU Optimize
Latency After optimization , We are also open to relay log After CPU The occupancy was tested , Found that after opening CPU The occupancy rate is also high , The picture below is in 4 table 4 task Check the test results （ notes ： If there are no special instructions in the follow-up , Are the test results in this scenario ）, It can be seen that relay after CPU Consumption has increased significantly , And the spikes get bigger ：

Use golang Self contained pprof Made a CPU profile, As can be seen from the figure below, the major ones are syncer/relay reader/relay writer Other parts , After comparing code logic , Find out ：

Relay reader Used go-mysql Of ParseFile Interface , The interface reopens the file every time it is called , And read the first FORMAT_DESCRIPTION event , That is, the position of the first blue mark in the figure below ;
Optimizing latency when , because relay reader and writer It's independent of each other , To simplify implementation , Only through channel Notify whether there are new binlog write in , And the newly written binlog It may have been read when it was last read , This leads to a lot of ineffective meta、index Inspection of documents .

Answer the above question , We made the following optimization ：

Use go-mysql Of ParseReader To eliminate the consumption of repeated opening and reading ;
restructure relay modular , take relay writer and reader Integrated , Facilitate communication between the two .relay reader Through channel After receiving the notice , Check current relay writer Whether the file being written is the same as the file being read , That is, whether the file is active Write status , And get the location where the current file is written , Through this information , Can avoid invalid meta、index Inspection of documents .
As can be seen from the figure below, after optimization CPU There is a significant reduction , But the spikes are still large ：

Because we test with sysbench produce write The rate of events is relatively stable ,DM There is no special execution code in , and Golang Is a compiled tape GC Language , So we guess that the spikes mainly come from GC, But this comes from CPU profile Not obvious , See the picture below ：

Use GODEBUG=gctrace=1 Turn on GC journal , See the picture below , You can find ：

Turn on relay after , The resident memory is nearly twice as large （239 MB -> 449 MB）,Heap The total space is nearly doubled .
After investigation, the problem is due to DM embedded tidb Cause memory leaks , Not processed yet .
Turn on relay after ,GC The amount of CPU A substantial increase , especially background GC time and idle GC time.

The following figure shows the optimization done above heap profile in alloc_space Partial flame diagram ：

explain ：pprof Of heap profile It is an analysis of the program running so far , Not for a certain period of time , So you can also see some applications for resident memory from the figure below .

adopt heap profile And compare the code , The following points that can be optimized are found ：

Go-mysql Parsing from file binlog event when , Every event Will reapply for one bytes.Buffer, And continuously expand the capacity in the reading process . After optimization, use a buffer pool, Reduce the overhead caused by continuous capacity expansion ;
Local streamer in heatbeat event Used time.After , Using this interface code will be more concise , But the interface creates channel Only trigger timer Will be released , in addition Local streamer The read event is a high frequency call , Each call creates a timer channel It's also expensive . After optimization, it is changed to reuse timer;
Relay An... Was used when reading events from upstream timeout context, Each read creates an additional channel, And in the current scenario , The timeout context There's no need to . After optimization, the timeout context;
Relay reader、relay writer write in debug Log not detected log level, Every time I create some Field object , Although not , However, due to the high frequency of these operations , It will also bring some expenses . After optimization , For high frequency calls debug journal , increase log level Judge .
explain ：DM Used for writing logs zap logger , The logger Good performance , Under non high frequency call , Call directly log.Debug Generally, it's OK .
The optimized results are as follows , It can be seen that CPU A lot lower , Spikes are also much less ：

The picture below is in 20 table 20 task Test results under scenario ：

remaining problems & The future work
After the above optimization , Turn on relay Compared with relay, stay latency The gap is already small , CPU The growth of is also at a relatively low level , But there are still some points that can be optimized , It is expected that... Will be added gradually in subsequent versions , as follows ：

go-mysql Read the file using io.CopyN, This function will apply for a small object , In case of high frequency use, it is still right GC There are some effects , But not much , I haven't changed this time ;
Some are right no relay and relay The optimization that takes effect at the same time has not been done this time , such as streamer Read event Created when timeout context;
At present, there are many reader Reading the same file still has a lot of overhead , Possible solutions for re optimization ：
Reduce the number of reads , For example, a reader After reading 、 Other reader Read memory and so on , Or increase the memory cache as envisaged before ;
Merge the same downstream task, Reduce task Number .

当前位置：网站首页>Relay log performance optimization practice in DM - tidb tool sharing

Relay log performance optimization practice in DM - tidb tool sharing

边栏推荐

猜你喜欢

随机推荐