当前位置:网站首页>Practice of Flink CDC in Dajian cloud warehouse
Practice of Flink CDC in Dajian cloud warehouse
2022-06-11 18:58:00 【Apache Flink】
This article is about the head of infrastructure of Jijian cloud warehouse 、Flink CDC Maintainer Gongzhongqiang is 5 month 21 Japan Flink CDC Meetup Speech . The main contents include :
- introduce Flink CDC The background of
- Today's internal business scenarios
- Future internal promotion and platform construction
- Community cooperation
Click to view live playback & speech PDF
One 、 introduce Flink CDC The background of

The company introduced CDC technology , It is mainly based on the requirements of the following four roles :
- Logistics scientists : Need Inventory 、 Sales order 、 Logistics bills and other data are used for analysis .
- Development : You need to synchronize the basic information of other business systems .
- financial : It is hoped that the financial data can be transmitted to the financial system in real time , Not until the end of the month .
- Boss : Large data screen is required , View the business and operation of the company through the large screen .

CDC Is the technology of data capture change . In a broad sense , Any technology that can capture data changes , Can be called CDC. But usually we say CDC The technology is mainly oriented to the change of database .
CDC There are two main ways to implement the , They are query based and log based :
- Based on queries : Insert after query 、 Update to the database , No special configuration of the database and account permissions . Its real-time performance depends on the query frequency , Real time can only be ensured by increasing the query frequency , And this is bound to DB There's a lot of pressure . Besides , Because it is based on query , So it cannot capture the change record of data between two queries , The consistency of data cannot be guaranteed .
- Log based : Through the change log of real-time consumption data , Therefore, the real-time performance is very high . And not to DB It has a great impact , It can also ensure data consistency , Because the database will record all data changes in the change log . Through the consumption of logs , The change process of data can be clearly known . Its disadvantage is that its implementation is relatively complex , Because the implementation of change logs is different for different databases , Format 、 The opening method and special permission are different , Appropriate adaptation development should be made for each database .

just as Flink The Manifesto of “ Real time is the future ”, In today's context , Real time is an important problem to be solved . therefore , We will mainstream CDC Log based techniques are compared , As shown in the figure above :
- data source :Flink CDC In addition to good support for traditional relational databases , For document type 、NewSQL(TiDB、OceanBase) And other popular databases can support ;Debezium Support for databases is relatively less extensive , But it supports the mainstream relational databases very well ;Canal and OGG Only a single data source is supported .
- Breakpoint continuation : All four technologies can support .
- Synchronous mode : except Canal Only incremental , Other technologies support the full amount of + Incremental way . And total quantity + The incremental mode means that the switching process from full volume to incremental can be completed for the first time CDC Technical realization , There is no need to add incremental tasks to the total number of tasks job To achieve the full amount + Incremental data reading .
- Activity level :Flink CDC Have a very active community , Rich information , The official also provides detailed tutorials and quick start tutorials ;Debezium The community is also quite active , But most of the materials are in English ;Canal The user base of is very large , There are also relatively many materials , But the community activity is average ;OGG yes Oracle Big data suite , Need to pay , Only official information .
- Development difficulty :Flink CDC rely on Flink SQL and Flink DataStream Two development models , In especial Flink SQL, Through very simple SQL You can complete the development of data synchronization tasks , It's easy to get started ;Debezium You need to parse the collected data change log for separate processing ,Canal The same is true .
- The operating environment depends on :Flink CDC In order to Flink As the engine ,Debezium Usually it's going to be Kafka connector As a running container ; and Canal and OGG All run separately .
- Downstream abundance :Flink CDC rely on Flink Very active surroundings and rich ecology , Be able to get through rich downstream , For common relational databases and big data storage engines Iceberg、ClickHouse、Hudi And so on ;Debezium Yes Kafka JDBC connector, Support MySQL 、Oracle 、SqlServer;Canal You can only consume data directly or export it to MQ Middle and downstream consumption ; OGG Because it's an official Suite , Poor downstream abundance .
Two 、 Today's internal business scenarios

- 2018 Years ago , The data synchronization mode of Dajian cloud warehouse is : Time synchronization of data between systems through multiple data applications .
- 2020 Years later, , With the rapid development of cross-border business , Multi data source applications are often full DB Affect online applications , At the same time, the execution sequence management of scheduled tasks is chaotic .
- therefore , 2021 In, we began to investigate and select models CDC technology , Set up a small test scenario , Carry out small-scale experiments .
- 2022 year , Online based on Flink CDC Realized LDSS System inventory scenario synchronization function .
- future , We hope to rely on Flink CDC Create a data synchronization platform , Complete the development of synchronization tasks through interface development and configuration 、 Test and launch , Be able to manage the whole life cycle of synchronization tasks online .

LDSS There are four business scenarios for inventory management :
- Storage department : The inventory capacity of the warehouse and the distribution of commodity categories are required to be reasonable , Inventory capacity , You need to leave some buffer In case of sudden warehouse receipt leading to warehouse explosion ; In terms of commodity categories , The unreasonable allocation of seasonal commodity inventory leads to hot issues , This will bring great challenges to the warehouse management .
- Platform customers : Hope the order processing is timely , Goods can be delivered quickly 、 Deliver it to customers accurately .
- Logistics department : Hope to improve logistics efficiency , Reduce logistics cost , Efficient use of limited capacity .
- Decision making department : hope LDSS The system can provide scientific suggestions on when and where to build a new warehouse .

The picture above shows LDSS Inventory management split scenario architecture diagram .
First , Pull down the storage system through the application of multi data source synchronization 、 Platform system and internal ERP system data , Extract the required data to LDSS In the database of the system , To support LDSS System order 、 stock 、 The business functions of the three logistics modules .
secondly , Need product information 、 Order information and warehouse information can be used to make effective splitting decisions . The multi data source timing synchronization task is based on JDBC Inquire about , Sift through time , Synchronize changed data to LDSS In the system . LDSS The system makes order splitting decisions based on these data , To get the optimal solution .

Timed task synchronization code , First, you need to define a scheduled task 、 Define classes for scheduled tasks 、 Execution method and execution interval .
The left side of the figure above shows the definition of a scheduled task , On the right is the logic development of scheduled tasks . First , open Oracle Database to query , then upsert To MySQL database , That is to complete the development of scheduled tasks . Here, it is close to the original JDBC Query mode of , Insert the data into the corresponding database table in turn , The development logic is very tedious , It's easy to see bug.
therefore , We are based on Flink CDC It has been transformed .

The picture above is based on Flink CDC The real-time synchronization scenario , The only change is to replace the previous multi-source synchronization application with Flink CDC .
First , adopt SqlServer CDC、MySQL CDC、Oracle CDC Connect and extract the corresponding storage platforms respectively 、 ERP Table data of system database , And then through Flink Provided JDBC connector Write to LDSS Systematic MySQL In the database . Can pass SqlServer CDC、MySQL CDC、Oracle CDC Transform heterogeneous data sources into unified Flink Internal type , Then swim down to write .
This architecture is compared to the previous architecture , Non intrusive to business systems , And the implementation is relatively simple .

We introduced MySQL CDC and SqlServer CDC Separate connection B2B Platform MySQL Database and warehouse system SqlServer database , Then the extracted data is passed through JDBC Connector Write to LDSS Systematic MySQL database .
Through the above transformation , Thanks to the Flink CDC Give real-time power , There is no need to manage complicated scheduled tasks .

be based on Flink CDC The implementation of synchronization code is divided into the following three steps :
- First step , Define source table —— Need to synchronize the table ;
- The second step , Define target table —— The target table to which data needs to be written ;
- The third step , adopt insert select sentence , Can finish CDC Synchronous task development .
The above development mode is very simple , Clear logic . Besides , Depending on the Flink CDC Synchronization tasks and Flink framework , Also got a failed retry 、 Distributed 、 High availability 、 Full incremental consistency switching, etc .
3、 ... and 、 Future internal promotion and platform construction

The above figure shows the platform architecture .
left source By Flink CDC + Flink The source side provided , It can extract data through rich source side , Write to the target through the development on the data platform . The target side relies on Flink Strong ecology , It can well support the data lake 、 Relational database 、MQ etc. .
Flink At present, there are two modes of operation , One is popular in China Flink on Yarn, The other is Flink on Kubernets. The data platform in the middle part is managed downward Flink colony , To support upward SQL Online development 、 Task development 、 Blood management 、 Task submitted 、 On-line Notebook Development 、 Permissions and configuration, as well as task performance monitoring and alarm , At the same time, it can also manage the data source well .

The demand for data synchronization is particularly strong within the company , The development efficiency needs to be improved through the platform , Speed up delivery . And after the platform , It can unify the data synchronization technology within the company , Collapse the synchronization stack , Reduce maintenance costs .
The goals of platformization are as follows :
- Be able to manage data sources well 、 Table and other meta information ;
- The whole life cycle of the task can be completed on the platform ;
- Realize task performance observation and alarm ;
- Simplify the development , Quick start , Business developers can start to develop synchronization tasks after simple training .
Platformization can bring benefits in the following three aspects :
- Collapse data synchronization task , To manage ;
- The whole life cycle of platform management and maintenance synchronization tasks ;
- A dedicated team is responsible for , The team can focus on cutting-edge data integration technologies .
With the platform , You can quickly apply more business scenarios .

- Real time data warehouse : Hope to pass Flink CDC To support more real-time data warehouse business scenarios , With the help of Flink Powerful computing power to do some materialized views of the database . Calculate from DB Free from , adopt Flink And then write back to the database , To accelerate the reporting of platform applications 、 Statistics 、 Analysis and other real-time application scenarios .
- Real time applications :Flink CDC Can from DB Layer capture changes , So you can use Flink CDC Real time update of content in search engine , Push financial and accounting data to the financial system in real time . Because most of the data in the financial system needs the business system to run scheduled tasks and go through a lot of correlation 、 polymerization 、 Grouping and other operations can be calculated , And then push it to the financial system . And with the help of Flink CDC Powerful data capture capability , Plus Flink Computing power , Push these data to the accounting system and financial system in real time , You can find problems in the business in time , Reduce the company's losses .
- cache : adopt Flink CDC, It can build a real-time cache that breaks away from traditional applications , Greatly improve the performance of online applications .
With the help of the platform , Believe in Flink CDC It can better release its ability within the company .
Four 、 Community cooperation

We hope to carry out diversified cooperation with the community , To improve the quality of our code and the company's open source cooperation ability . Community cooperation will be carried out mainly through three aspects :
- First of all , Open source co construction . I hope to have more opportunities to communicate and share with my peers Flink CDC Experience and access scenarios in the company , Training will also be conducted internally Flink CDC technology , Let everyone know through training Flink CDC technology , This technology can be used to solve more business pain points in practical work .
At present, the cooperation between the company and the community has achieved some results , The company has contributed to the community SqlServer CDC Connector And cooperation completed TiDB CDC Connector.
- second , Serving the community . Cultivate the open source cooperation ability of department development , And contribute the features of the company's internal version to the community , Only after being polished by the majority of users in the community , Features can be more stable and reasonable . Besides , Also hope to be able to schema evolution、turning performance、 Close cooperation with the community in the direction of synchronization of the whole database .
- Third , Explore the direction . Believe in Flink CDC Will not be satisfied with the achievements of the present , Will certainly continue to move towards further goals . So I hope to explore with the community Flink CDC More possible directions .
The recent cooperation between the company and the community is : take SqlServer CDC The features implemented based on the concurrent lock free framework contribute to the community .

The picture above shows SqlServer CDC Principle .
First , SqlServer Data changes will be recorded to transaction log in , Match through the captured process log In the open CDC table Of log journal , Insert the matched log into after conversion CDC Generated change tables in , In the end by the SqlServer CDC call CDC query function Get... In real time insert、update、delete as well as DDL sentence , And then convert it into Flink Inside OpType and RawData Do calculations 、 Entering the lake, etc .

Community students use the current version of SqlServer CDC after , The main feedback questions are as follows :
- Lock the table during snapshot : The table locking operation is for DBA And online applications are intolerable , DBA Unable to accept database tampering , It will also affect online applications .
- During snapshot, you cannot checkpoint: You can't checkpoint It means that once the snapshot process fails , You can only restart the snapshot process , This is very unfriendly to big watches .
- Snapshot process only supports single concurrency : Tens of millions 、 Hundreds of millions of big watches , In the case of single concurrency, it needs to synchronize for more than ten or even dozens of hours , A great constraint on SqlServer CDC Application scenarios of .

We have practiced and improved the above problems , Reference community 2.0 edition MySQL CDC The idea of concurrent lock free algorithm , Yes SqlServer CDC optimized , Finally, no lock in the snapshot process is realized , Implement consistent snapshots ; Support during snapshot checkpoint ; Concurrency is supported during snapshot , Speed up the snapshot process . In the case of large table synchronization , The advantage of concurrency is particularly obvious .
But because of 2.2 The version community will MySQL The idea of concurrency without lock is abstracted into a unified public framework ,SqlServer CDC This common framework needs to be re adapted before it can be contributed to the community .
Question and answer
Q: Need to open SqlServer Their own CDC Do you ?
A: Yes ,SqlServer CDC The function of is based on SqlServer The database itself CDC Feature implementation .
Q: How do materialized views refresh scheduled task triggers ?
A: adopt Flink CDC Will need to generate materialized views of SQL Put it in Flink Running in , The calculation is triggered by the change of the original table , Then synchronize to the materialized view table .
Q: How does platformization work ?
A: Platformization refers to many open source projects and excellent open source platforms in the community , such as StreamX、DLink Excellent open source projects .
Q:SqlServer CDC In the consumer transaction log Is there a bottleneck ?
A:SqlServer No direct consumption log, The principle is SqlServer capture process To match log Which tables in the are turned on CDC , Then pull these tables from the log and open them CDC Change data of table , Then insert into change table in , Finally by opening CDC After that, the database generates CDC query function Get changes to data .
Q:Flink CDC How does high availability ensure that there are too many synchronization tasks or intensive processing schemes ?
A:Flink High availability of depends on Flink Features such as checkpoint Wait to make sure . Too many synchronization tasks or intensive processing schemes , Multiple sets are recommended Flink Downstream clusters , And then treat them differently according to the real-time nature of synchronization , Publish the task to the corresponding cluster .
Q: In the middle Kafka Do you ?
A: It depends on whether the synchronization task or data warehouse architecture needs to do the intermediate data Kafka to ground .
Q: There are multiple tables in a database , Can I put it into a task to run ?
A: It depends on the development method . If it is SQL Development mode , To write multiple tables at one time, you can only use multiple tasks . but Flink CDC It provides another high-level development method DataStream , You can put multiple tables into one task to run .
Q:Flink CDC Support reading Oracle Log from the library ?
A: It is not yet possible to achieve .
Q: adopt CDC How to monitor the data quality of the two terminals after synchronization , How to compare ?
A: At present, only regular sampling can be used to check the data quality , Data quality has always been a thorny problem in the industry .
Q: What scheduling system does Dajian cloud warehouse use ? How does the system work with Flink CDC aggregate ?
A: Use XXL Job As a distributed task scheduling ,CDC No scheduled tasks are used .
Q: If the collection add / delete table ,SqlServer CDC Do you need to restart ?
A:SqlServer CDC Currently, the function of adding tables dynamically is not supported .
Q: Will synchronization tasks affect system performance ?
A: be based on CDC Doing synchronization tasks will definitely affect the system performance , In particular, the snapshot process will have an impact on the database , And then affect the application system . The community will limit the flow in the future 、 For all connector Do concurrent lock free implementation , All for expansion CDC Application scenarios and ease of use .
Q: Full and incremental savepoint How to deal with ?
A:( Connectors not implemented by the concurrent lock free framework ) It is not allowed to trigger in the whole process savepoint, In the incremental process, if you need to stop the release , It can be done by savepoint Recovery task .
Q:CDC Synchronize data to Kafka , and Kafka What's in it is Binlog , How to save historical data and real-time data ?
A: take CDC All synchronized data Sync To Kafka, The data retained depends on Kafka log The clean-up strategy , You can keep them all .
Q:CDC Would be right Binlog To filter the log operation type of ? Will it affect efficiency ?
A: Even with filtering , It has little impact on performance .
Q:CDC read MySQL Initialize the snapshot phase , When multiple programs read different tables, the program will report an error and cannot obtain the permission to lock the table , What's the reason ?
A: It is recommended to check it first MySQL CDC Whether to use the old way to realize , You can try a new version of the concurrent lock free implementation .
Q:MySQL How to link up the full volume and increment of hundreds of millions of large-scale meters ?
A: It is suggested that the teacher read Xuejin in 2.0 Related blogs , It is very simple and clear to introduce how to realize consistent snapshot without lock , Complete the switch between full volume and increment .
边栏推荐
- 更换目标检测的backbone(以Faster RCNN为例)
- Leetcode: sword finger offer 59 - ii Maximum value of queue [deque + sortedlist]
- Leetcode: sword finger offer 56 - ii Number of occurrences of numbers in the array II [simple sort]
- Analysis of runtime instantiation of XML view root node in SAP ui5
- 【视频去噪】基于SALT实现视频去噪附Matlab代码
- 软件开发的整体流程
- kubernetes 二进制安装(v1.20.15)(九)收尾:部署几个仪表盘
- cf:C. Restoring the Duration of Tasks【找规律】
- Téléchargement et téléchargement des fichiers nécessaires au développement
- Ti am64x - the latest 16nm processing platform, designed for industrial gateways and industrial robots
猜你喜欢

Force deduction 23 questions, merging K ascending linked lists
![Leetcode: sword finger offer 59 - ii Maximum value of queue [deque + sortedlist]](/img/6b/f2e04cd1f3aaa9fe057c292301894a.png)
Leetcode: sword finger offer 59 - ii Maximum value of queue [deque + sortedlist]
制造出静态坦克
![leetcode:926. Flip the string to monotonically increasing [prefix and + analog analysis]](/img/e8/a43b397155c6957b142dd0feb59885.png)
leetcode:926. Flip the string to monotonically increasing [prefix and + analog analysis]

给你一个项目,你将如何开展性能测试工作?

Visual slam lecture notes-10-1

一款自适应的聊天网站-匿名在线聊天室PHP源码

Niu Ke's question -- Fibonacci series

Visual slam lecture notes-10-2

2023年西安交通大学管理学院MPAcc提前批面试网报通知
随机推荐
cf:F. Shifting String【字符串按指定顺序重排 + 分组成环(切割联通分量) + 各组最小相同移动周期 + 最小公倍数】
Tips for using apipost
Flink CDC 在大健云仓的实践
Introduction to basic use and pit closure of BigDecimal
Friendly tanks fire bullets
让我们的坦克欢快的动起来吧
记录一下phpstudy配置php8.0和php8.1扩展redis
Force deduction 23 questions, merging K ascending linked lists
手把手教你学会FIRST集和FOLLOW集!!!!吐血收藏!!保姆级讲解!!!
【Multisim仿真】利用运算放大器产生锯齿波
给你一个项目,你将如何开展性能测试工作?
Non recursive traversal of binary tree
使用canvas给页面添加文字水印
SQL injection vulnerability learning 1: phpstudy integrated environment building DVWA shooting range
Uni app Muke hot search project (I) production of tabbar
Niu Ke's question -- Fibonacci series
Swagger2 easy to use
The Economist: WTO MC12 restarts the digital economy and becomes the core engine of global economic recovery and growth
Map and set
The nearest common ancestor of binary tree