当前位置:网站首页>CDC (change data capture technology), a powerful tool for real-time database synchronization
CDC (change data capture technology), a powerful tool for real-time database synchronization
2022-07-07 08:07:00 【Merrill Lynch data tempodata】
Data in progress ETL In the process , We often need to periodically schedule the business data according to T+1 Synchronization to the data warehouse , Data analysis and processing , Finally through BI Report presentation to end users . But this method has poor real-time performance , Users often can only see yesterday's data , It will affect the timeliness of users' decisions ;
If users want to view reports in near real time , You need to increase the scheduling cycle frequency to hours or minutes , This is a great test for the whole data analysis system ; And the above process is only applicable to the continuous increase of data , If you encounter business data modification 、 The process of deletion , You can only synchronize the coverage in full each time ; In the face of the above data synchronization process, data timeliness requirements are high 、 Historical data will change , We can use change data capture technology to synchronize data in real time .
What is change data capture ?
Change data capture (Change Data Capture, abbreviation CDC) It refers to identifying and capturing the changes made to the data in the database ( Includes the insertion of data or data tables 、 to update 、 Delete etc. ), Then make a complete record of these changes in the order in which they occurred , And real-time transmission to downstream processes or systems through message middleware . In this way ,CDC Able to provide efficient services to the data warehouse 、 Low latency data transmission , So that the information can be converted in time and delivered to the application for analysis .
CDC What are the advantages ?
For all kinds of time sensitive data, it is very suitable to pass CDC Synchronous transmission , It has the following benefits :
Real time streaming through incremental loading or data changes , There is no need for periodic scheduling to perform batch load update operations .
CDC Real time synchronous transmission of data , It facilitates non-stop database migration , And support real-time analysis , It can help users make faster decisions based on the latest data 、 More accurate decisions .
CDC Minimize the data transmission network traffic , Suitable for data transmission across Wan .
CDC It can ensure that the data in multiple systems remains synchronized .
CDC What are the use scenarios of ?
CDC Technology has a wide range of application scenarios , Include :
Data dissemination : Distribute data from one data source to multiple downstream business systems , It is often used for business decoupling 、 Microservice system .
Data collection : Data warehouse oriented 、 Data lake ETL Data integration , Eliminate data islands , For subsequent analysis .
Data synchronization : Often used for data backup 、 Disaster tolerance, etc .
Common change data capture methods
Query based CDC
In this way , You need to constantly query the data in the source database table , To get the changed data record ; In the query process, you need to use some columns to determine which data is changed ; The common ones are timestamp columns 、 Auto increment sequence column , You can save the creation time column 、 Modify the time column to indicate the insertion 、 Records of changes , The auto increment column can also easily identify the newly inserted records .
Trigger based CDC
In this method , When the business system performs insertion 、 to update 、 Delete these SQL when , To activate the trigger of the database , Make it capture changes to data records , And save the data in a temporary table , Finally, the changed data is extracted from the temporary table into the data warehouse .
Snapshot based CDC
If the above triggers and the method of adding column queries are not allowed , You can use snapshot tables and other methods to capture change data ; The implementation idea is to compare the source table and snapshot table , Get the change information of data , The insertion can be detected by snapshot 、 Updated and deleted data records .
Log based CDC
When the database table completes a new DML(insert,update,delete) After the operation , The database will record it to the log file in real time ; By parsing the database operation log , You can insert 、 to update 、 Deleted data changes can be captured , Send downstream system .
Above 4 Kind of CDC In the way of realization , Log based CDC Is the best way to achieve , High timeliness 、 No invasive , And can capture all changes ; If the database log file cannot be obtained and parsed , You can choose the other three methods CDC;
Although the snapshot based method can capture all change records , But its obvious disadvantage is that it needs a lot of storage space to save snapshot data , And the real-time performance is low ;
Trigger based approach because you need to add triggers , Then write the changed data many times , It's invasive ; The query based method requires time columns on the data table 、 Add self incrementing column , Invasive strong , And cannot get delete operation , So it's rarely used .
CDC Change log flow
We have been right ahead CDC Have a preliminary understanding of ,CDC The core idea of is to capture and identify data changes , And send it to the downstream system , In what form is the data change process sent to the downstream system , That's it CDC Change log flow .
CDC The program will include inserting 、 to update 、 The deleted data operation is transformed by parsing , Form a unified and standardized change message and pass it to the downstream system , These message flows include INSERT(+I),UPDATE_BEFORE(-U),UPDATE_AFTER(+U),DELETE(-D) Four message state semantics :
INSERT(+I): Newly inserted data record row
UPDATE_BEFORE(-U): The data before the data record line is updated
UPDATE_AFTER(+U): The data after the data record line is updated
DELETE(-D): Deleted data record lines
Let's take the business data change process of a personnel information table as an example , Conduct CDC Explanation of change log flow . The personnel table has personnel ID(id), full name (name), Age (age) Etc , Insert it as follows 、 to update 、 Delete data record transaction operation :
1. Insert a piece of personnel information in the personnel table ,ID by 1, His name is Xiao Ming ,18 year .
2. Then insert a piece of personnel information into the personnel table ,ID by 2, His name is Li Hua ,32 year .
3. Change Xiao Ming's age to 20 year .
4. Delete Li Hua from the personnel table .
5. Finally, insert a piece of personnel information into the personnel table ,ID by 3, Name is Lili , Age is 8 year .aa
Of the above personnel table CDC The change log flow is as follows :
The final personnel table data is as follows :
In addition to the representation of the change log flow in the above example , It can also be expressed in other formats , As long as you accurately describe the above 4 Change the message semantics in .
CDC The change log flow can record the data change records of the whole table , Enables us to execute the change flow , Stop at any position , And will be CDC The data of the table is restored to any time , This is more reliable and space efficient than scheduled backups .
stay Tempo DF How to do it in the data factory CDC
Tempo Data factory is the integration of massive data 、 Real time data processing 、 Offline data processing 、 Custom component extension 、 Big data development platform with five core functions of integrated monitoring, operation and maintenance , It reduces the integration cost of multi-source heterogeneous data for enterprise users , Enable full link data development , Let data better play its value .
stay Tempo In the data factory platform , Users can quickly configure and complete a real-time self-service process for business data by dragging and dropping CDC, And the subsequent calculation and processing can be carried out , Finally, write the data to the target source , A complete CDC Business data flow , Here's the picture :
We can input the left side into the node MySQL CDC Drag into the canvas on the right , Double click to open the node configuration panel , By selecting the configured MySQL data source , Select what needs to be done CDC Library table for , The node automatically reads the column information of the table , Finally, click the Apply button in the upper right corner , Such a MySQL Data source table CDC The input node is configured , The configuration is as follows :
at present Tempo Supported by the data factory CDC The database list of is as follows :
database | edition |
Oracle | edition : 12c, 19c, 21c |
MySQL | edition : 5.7, 8.0.x |
PostgreSQL | edition : 10, 11, 12, 13, 14 plug-in unit : decoderbufs, pgoutput |
SQL Server( Hatching ) | edition : 2017, 2019 |
Db2( Hatching ) | edition : 11.5 |
If you are in the actual business data analysis process , Want to improve the timeliness of data , Reduce the difficulty of handling data changes , You can try to use CDC Real time data synchronization , and Tempo Data factory allows you to apply it faster .
边栏推荐
- Avatary的LiveDriver试用体验
- 2022 National latest fire-fighting facility operator (primary fire-fighting facility operator) simulation questions and answers
- 调用 pytorch API完成线性回归
- Linux server development, MySQL index principle and optimization
- [UVM foundation] what is transaction
- 2022 Inner Mongolia latest advanced fire facility operator simulation examination question bank and answers
- Lattice coloring - matrix fast power optimized shape pressure DP
- Recursive method to construct binary tree from preorder and inorder traversal sequence
- paddlepaddle 29 无模型定义代码下动态修改网络结构(relu变prelu,conv2d变conv3d,2d语义分割模型改为3d语义分割模型)
- 青龙面板--整理能用脚本
猜你喜欢
2022 recurrent training question bank and answers of refrigeration and air conditioning equipment operation
The charm of SQL optimization! From 30248s to 0.001s
Cnopendata list data of Chinese colleges and Universities
Network learning (I) -- basic model learning
Quickly use Jacobo code coverage statistics
[quick start of Digital IC Verification] 17. Basic grammar of SystemVerilog learning 4 (randomization)
Who has docker to install MySQL locally?
【數字IC驗證快速入門】15、SystemVerilog學習之基本語法2(操作符、類型轉換、循環、Task/Function...內含實踐練習)
Linux server development, detailed explanation of redis related commands and their principles
2022 Inner Mongolia latest advanced fire facility operator simulation examination question bank and answers
随机推荐
C语言二叉树与建堆
【数字IC验证快速入门】14、SystemVerilog学习之基本语法1(数组、队列、结构体、枚举、字符串...内含实践练习)
Merging binary trees by recursion
buureservewp(2)
Visualization Document Feb 12 16:42
【数字IC验证快速入门】10、Verilog RTL设计必会的FIFO
game攻防世界逆向
Roulette chart 2 - writing of roulette chart code
【踩坑系列】uniapp之h5 跨域的问题
Quickly use Jacobo code coverage statistics
Bugku CTF daily one question chessboard with only black chess
B. Value sequence thinking
[UVM basics] summary of important knowledge points of "UVM practice" (continuous update...)
Real time monitoring of dog walking and rope pulling AI recognition helps smart city
Most elements
C language flight booking system
快解析内网穿透为文档加密行业保驾护航
运放电路的反馈电阻上并联一个电容是什么作用
[CV] Wu Enda machine learning course notes | Chapter 8
Binary tree and heap building in C language