当前位置:网站首页>CDC (change data capture technology), a powerful tool for real-time database synchronization
CDC (change data capture technology), a powerful tool for real-time database synchronization
2022-07-07 08:07:00 【Merrill Lynch data tempodata】
Data in progress ETL In the process , We often need to periodically schedule the business data according to T+1 Synchronization to the data warehouse , Data analysis and processing , Finally through BI Report presentation to end users . But this method has poor real-time performance , Users often can only see yesterday's data , It will affect the timeliness of users' decisions ;
If users want to view reports in near real time , You need to increase the scheduling cycle frequency to hours or minutes , This is a great test for the whole data analysis system ; And the above process is only applicable to the continuous increase of data , If you encounter business data modification 、 The process of deletion , You can only synchronize the coverage in full each time ; In the face of the above data synchronization process, data timeliness requirements are high 、 Historical data will change , We can use change data capture technology to synchronize data in real time .
What is change data capture ?
Change data capture (Change Data Capture, abbreviation CDC) It refers to identifying and capturing the changes made to the data in the database ( Includes the insertion of data or data tables 、 to update 、 Delete etc. ), Then make a complete record of these changes in the order in which they occurred , And real-time transmission to downstream processes or systems through message middleware . In this way ,CDC Able to provide efficient services to the data warehouse 、 Low latency data transmission , So that the information can be converted in time and delivered to the application for analysis .
CDC What are the advantages ?
For all kinds of time sensitive data, it is very suitable to pass CDC Synchronous transmission , It has the following benefits :
Real time streaming through incremental loading or data changes , There is no need for periodic scheduling to perform batch load update operations .
CDC Real time synchronous transmission of data , It facilitates non-stop database migration , And support real-time analysis , It can help users make faster decisions based on the latest data 、 More accurate decisions .
CDC Minimize the data transmission network traffic , Suitable for data transmission across Wan .
CDC It can ensure that the data in multiple systems remains synchronized .
CDC What are the use scenarios of ?
CDC Technology has a wide range of application scenarios , Include :
Data dissemination : Distribute data from one data source to multiple downstream business systems , It is often used for business decoupling 、 Microservice system .
Data collection : Data warehouse oriented 、 Data lake ETL Data integration , Eliminate data islands , For subsequent analysis .
Data synchronization : Often used for data backup 、 Disaster tolerance, etc .
Common change data capture methods
Query based CDC
In this way , You need to constantly query the data in the source database table , To get the changed data record ; In the query process, you need to use some columns to determine which data is changed ; The common ones are timestamp columns 、 Auto increment sequence column , You can save the creation time column 、 Modify the time column to indicate the insertion 、 Records of changes , The auto increment column can also easily identify the newly inserted records .
Trigger based CDC
In this method , When the business system performs insertion 、 to update 、 Delete these SQL when , To activate the trigger of the database , Make it capture changes to data records , And save the data in a temporary table , Finally, the changed data is extracted from the temporary table into the data warehouse .
Snapshot based CDC
If the above triggers and the method of adding column queries are not allowed , You can use snapshot tables and other methods to capture change data ; The implementation idea is to compare the source table and snapshot table , Get the change information of data , The insertion can be detected by snapshot 、 Updated and deleted data records .
Log based CDC
When the database table completes a new DML(insert,update,delete) After the operation , The database will record it to the log file in real time ; By parsing the database operation log , You can insert 、 to update 、 Deleted data changes can be captured , Send downstream system .
Above 4 Kind of CDC In the way of realization , Log based CDC Is the best way to achieve , High timeliness 、 No invasive , And can capture all changes ; If the database log file cannot be obtained and parsed , You can choose the other three methods CDC;
Although the snapshot based method can capture all change records , But its obvious disadvantage is that it needs a lot of storage space to save snapshot data , And the real-time performance is low ;
Trigger based approach because you need to add triggers , Then write the changed data many times , It's invasive ; The query based method requires time columns on the data table 、 Add self incrementing column , Invasive strong , And cannot get delete operation , So it's rarely used .
CDC Change log flow
We have been right ahead CDC Have a preliminary understanding of ,CDC The core idea of is to capture and identify data changes , And send it to the downstream system , In what form is the data change process sent to the downstream system , That's it CDC Change log flow .
CDC The program will include inserting 、 to update 、 The deleted data operation is transformed by parsing , Form a unified and standardized change message and pass it to the downstream system , These message flows include INSERT(+I),UPDATE_BEFORE(-U),UPDATE_AFTER(+U),DELETE(-D) Four message state semantics :
INSERT(+I): Newly inserted data record row
UPDATE_BEFORE(-U): The data before the data record line is updated
UPDATE_AFTER(+U): The data after the data record line is updated
DELETE(-D): Deleted data record lines
Let's take the business data change process of a personnel information table as an example , Conduct CDC Explanation of change log flow . The personnel table has personnel ID(id), full name (name), Age (age) Etc , Insert it as follows 、 to update 、 Delete data record transaction operation :
1. Insert a piece of personnel information in the personnel table ,ID by 1, His name is Xiao Ming ,18 year .
2. Then insert a piece of personnel information into the personnel table ,ID by 2, His name is Li Hua ,32 year .
3. Change Xiao Ming's age to 20 year .
4. Delete Li Hua from the personnel table .
5. Finally, insert a piece of personnel information into the personnel table ,ID by 3, Name is Lili , Age is 8 year .aa
Of the above personnel table CDC The change log flow is as follows :
The final personnel table data is as follows :
In addition to the representation of the change log flow in the above example , It can also be expressed in other formats , As long as you accurately describe the above 4 Change the message semantics in .
CDC The change log flow can record the data change records of the whole table , Enables us to execute the change flow , Stop at any position , And will be CDC The data of the table is restored to any time , This is more reliable and space efficient than scheduled backups .
stay Tempo DF How to do it in the data factory CDC
Tempo Data factory is the integration of massive data 、 Real time data processing 、 Offline data processing 、 Custom component extension 、 Big data development platform with five core functions of integrated monitoring, operation and maintenance , It reduces the integration cost of multi-source heterogeneous data for enterprise users , Enable full link data development , Let data better play its value .
stay Tempo In the data factory platform , Users can quickly configure and complete a real-time self-service process for business data by dragging and dropping CDC, And the subsequent calculation and processing can be carried out , Finally, write the data to the target source , A complete CDC Business data flow , Here's the picture :
We can input the left side into the node MySQL CDC Drag into the canvas on the right , Double click to open the node configuration panel , By selecting the configured MySQL data source , Select what needs to be done CDC Library table for , The node automatically reads the column information of the table , Finally, click the Apply button in the upper right corner , Such a MySQL Data source table CDC The input node is configured , The configuration is as follows :
at present Tempo Supported by the data factory CDC The database list of is as follows :
database | edition |
Oracle | edition : 12c, 19c, 21c |
MySQL | edition : 5.7, 8.0.x |
PostgreSQL | edition : 10, 11, 12, 13, 14 plug-in unit : decoderbufs, pgoutput |
SQL Server( Hatching ) | edition : 2017, 2019 |
Db2( Hatching ) | edition : 11.5 |
If you are in the actual business data analysis process , Want to improve the timeliness of data , Reduce the difficulty of handling data changes , You can try to use CDC Real time data synchronization , and Tempo Data factory allows you to apply it faster .
边栏推荐
- JS cross browser parsing XML application
- [quick start of Digital IC Verification] 17. Basic grammar of SystemVerilog learning 4 (randomization)
- DNS server configuration
- Es FAQ summary
- Avatary的LiveDriver试用体验
- Codeforce c.strange test and acwing
- 2022 simulated examination question bank and online simulated examination of tea master (primary) examination questions
- Recursive method constructs binary tree from middle order and post order traversal sequence
- [VHDL parallel statement execution]
- Summary of redis functions
猜你喜欢
QT learning 26 integrated example of layout management
Codeforce c.strange test and acwing
微信小程序基本组件使用介绍
电池、电机技术受到很大关注,反而电控技术却很少被提及?
Main window in QT learning 27 application
Leetcode 40: combined sum II
Linux server development, MySQL transaction principle analysis
MySQL multi column index (composite index) features and usage scenarios
【数字IC验证快速入门】14、SystemVerilog学习之基本语法1(数组、队列、结构体、枚举、字符串...内含实践练习)
[quick start of Digital IC Verification] 17. Basic grammar of SystemVerilog learning 4 (randomization)
随机推荐
【踩坑系列】uniapp之h5 跨域的问题
芯片 設計資料下載
微信小程序基本组件使用介绍
Main window in QT learning 27 application
【数字IC验证快速入门】17、SystemVerilog学习之基本语法4(随机化Randomization)
Avatary的LiveDriver试用体验
Jmeter 的使用
[matlab] when matrix multiplication in Simulink user-defined function does not work properly, matrix multiplication module in module library can be used instead
Linux server development, SQL statements, indexes, views, stored procedures, triggers
[UVM practice] Chapter 2: a simple UVM verification platform (2) only driver verification platform
C language communication travel card background system
【数字IC验证快速入门】11、Verilog TestBench(VTB)入门
3D reconstruction - stereo correction
复杂网络建模(二)
2022 simulated examination question bank and online simulated examination of tea master (primary) examination questions
Network learning (I) -- basic model learning
Few shot Learning & meta learning: small sample learning principle and Siamese network structure (I)
Recursive construction of maximum binary tree
Binary tree and heap building in C language
Recursive method constructs binary tree from middle order and post order traversal sequence