当前位置：网站首页>Practice of Flink CDC in Dajian cloud warehouse

Practice of Flink CDC in Dajian cloud warehouse

2022-06-11 18:58:00 【Apache Flink】

This article is about the head of infrastructure of Jijian cloud warehouse 、Flink CDC Maintainer Gongzhongqiang is 5 month 21 Japan Flink CDC Meetup Speech . The main contents include ：
introduce Flink CDC The background of
Today's internal business scenarios
Future internal promotion and platform construction
Community cooperation

Click to view live playback & speech PDF

One 、 introduce Flink CDC The background of

The company introduced CDC technology , It is mainly based on the requirements of the following four roles ：

Logistics scientists ： Need Inventory 、 Sales order 、 Logistics bills and other data are used for analysis .
Development ： You need to synchronize the basic information of other business systems .
financial ： It is hoped that the financial data can be transmitted to the financial system in real time , Not until the end of the month .
Boss ： Large data screen is required , View the business and operation of the company through the large screen .

CDC Is the technology of data capture change . In a broad sense , Any technology that can capture data changes , Can be called CDC. But usually we say CDC The technology is mainly oriented to the change of database .

CDC There are two main ways to implement the , They are query based and log based ：

Based on queries ： Insert after query 、 Update to the database , No special configuration of the database and account permissions . Its real-time performance depends on the query frequency , Real time can only be ensured by increasing the query frequency , And this is bound to DB There's a lot of pressure . Besides , Because it is based on query , So it cannot capture the change record of data between two queries , The consistency of data cannot be guaranteed .
Log based ： Through the change log of real-time consumption data , Therefore, the real-time performance is very high . And not to DB It has a great impact , It can also ensure data consistency , Because the database will record all data changes in the change log . Through the consumption of logs , The change process of data can be clearly known . Its disadvantage is that its implementation is relatively complex , Because the implementation of change logs is different for different databases , Format 、 The opening method and special permission are different , Appropriate adaptation development should be made for each database .

just as Flink The Manifesto of “ Real time is the future ”, In today's context , Real time is an important problem to be solved . therefore , We will mainstream CDC Log based techniques are compared , As shown in the figure above ：

data source ：Flink CDC In addition to good support for traditional relational databases , For document type 、NewSQL（TiDB、OceanBase） And other popular databases can support ;Debezium Support for databases is relatively less extensive , But it supports the mainstream relational databases very well ;Canal and OGG Only a single data source is supported .
Breakpoint continuation ： All four technologies can support .
Synchronous mode ： except Canal Only incremental , Other technologies support the full amount of + Incremental way . And total quantity + The incremental mode means that the switching process from full volume to incremental can be completed for the first time CDC Technical realization , There is no need to add incremental tasks to the total number of tasks job To achieve the full amount + Incremental data reading .
Activity level ：Flink CDC Have a very active community , Rich information , The official also provides detailed tutorials and quick start tutorials ;Debezium The community is also quite active , But most of the materials are in English ;Canal The user base of is very large , There are also relatively many materials , But the community activity is average ;OGG yes Oracle Big data suite , Need to pay , Only official information .
Development difficulty ：Flink CDC rely on Flink SQL and Flink DataStream Two development models , In especial Flink SQL, Through very simple SQL You can complete the development of data synchronization tasks , It's easy to get started ;Debezium You need to parse the collected data change log for separate processing ,Canal The same is true .
The operating environment depends on ：Flink CDC In order to Flink As the engine ,Debezium Usually it's going to be Kafka connector As a running container ; and Canal and OGG All run separately .
Downstream abundance ：Flink CDC rely on Flink Very active surroundings and rich ecology , Be able to get through rich downstream , For common relational databases and big data storage engines Iceberg、ClickHouse、Hudi And so on ;Debezium Yes Kafka JDBC connector, Support MySQL 、Oracle 、SqlServer;Canal You can only consume data directly or export it to MQ Middle and downstream consumption ; OGG Because it's an official Suite , Poor downstream abundance .

Two 、 Today's internal business scenarios

2018 Years ago , The data synchronization mode of Dajian cloud warehouse is ： Time synchronization of data between systems through multiple data applications .
2020 Years later, , With the rapid development of cross-border business , Multi data source applications are often full DB Affect online applications , At the same time, the execution sequence management of scheduled tasks is chaotic .
therefore , 2021 In, we began to investigate and select models CDC technology , Set up a small test scenario , Carry out small-scale experiments .
2022 year , Online based on Flink CDC Realized LDSS System inventory scenario synchronization function .
future , We hope to rely on Flink CDC Create a data synchronization platform , Complete the development of synchronization tasks through interface development and configuration 、 Test and launch , Be able to manage the whole life cycle of synchronization tasks online .

LDSS There are four business scenarios for inventory management ：

Storage department ： The inventory capacity of the warehouse and the distribution of commodity categories are required to be reasonable , Inventory capacity , You need to leave some buffer In case of sudden warehouse receipt leading to warehouse explosion ; In terms of commodity categories , The unreasonable allocation of seasonal commodity inventory leads to hot issues , This will bring great challenges to the warehouse management .
Platform customers ： Hope the order processing is timely , Goods can be delivered quickly 、 Deliver it to customers accurately .
Logistics department ： Hope to improve logistics efficiency , Reduce logistics cost , Efficient use of limited capacity .
Decision making department ： hope LDSS The system can provide scientific suggestions on when and where to build a new warehouse .

The picture above shows LDSS Inventory management split scenario architecture diagram .

First , Pull down the storage system through the application of multi data source synchronization 、 Platform system and internal ERP system data , Extract the required data to LDSS In the database of the system , To support LDSS System order 、 stock 、 The business functions of the three logistics modules .

secondly , Need product information 、 Order information and warehouse information can be used to make effective splitting decisions . The multi data source timing synchronization task is based on JDBC Inquire about , Sift through time , Synchronize changed data to LDSS In the system . LDSS The system makes order splitting decisions based on these data , To get the optimal solution .

Timed task synchronization code , First, you need to define a scheduled task 、 Define classes for scheduled tasks 、 Execution method and execution interval .

The left side of the figure above shows the definition of a scheduled task , On the right is the logic development of scheduled tasks . First , open Oracle Database to query , then upsert To MySQL database , That is to complete the development of scheduled tasks . Here, it is close to the original JDBC Query mode of , Insert the data into the corresponding database table in turn , The development logic is very tedious , It's easy to see bug.

therefore , We are based on Flink CDC It has been transformed .

The picture above is based on Flink CDC The real-time synchronization scenario , The only change is to replace the previous multi-source synchronization application with Flink CDC .

First , adopt SqlServer CDC、MySQL CDC、Oracle CDC Connect and extract the corresponding storage platforms respectively 、 ERP Table data of system database , And then through Flink Provided JDBC connector Write to LDSS Systematic MySQL In the database . Can pass SqlServer CDC、MySQL CDC、Oracle CDC Transform heterogeneous data sources into unified Flink Internal type , Then swim down to write .

This architecture is compared to the previous architecture , Non intrusive to business systems , And the implementation is relatively simple .

We introduced MySQL CDC and SqlServer CDC Separate connection B2B Platform MySQL Database and warehouse system SqlServer database , Then the extracted data is passed through JDBC Connector Write to LDSS Systematic MySQL database .

Through the above transformation , Thanks to the Flink CDC Give real-time power , There is no need to manage complicated scheduled tasks .

be based on Flink CDC The implementation of synchronization code is divided into the following three steps ：

First step , Define source table —— Need to synchronize the table ;
The second step , Define target table —— The target table to which data needs to be written ;
The third step , adopt insert select sentence , Can finish CDC Synchronous task development .

The above development mode is very simple , Clear logic . Besides , Depending on the Flink CDC Synchronization tasks and Flink framework , Also got a failed retry 、 Distributed 、 High availability 、 Full incremental consistency switching, etc .

3、 ... and 、 Future internal promotion and platform construction

The above figure shows the platform architecture .

left source By Flink CDC + Flink The source side provided , It can extract data through rich source side , Write to the target through the development on the data platform . The target side relies on Flink Strong ecology , It can well support the data lake 、 Relational database 、MQ etc. .

Flink At present, there are two modes of operation , One is popular in China Flink on Yarn, The other is Flink on Kubernets. The data platform in the middle part is managed downward Flink colony , To support upward SQL Online development 、 Task development 、 Blood management 、 Task submitted 、 On-line Notebook Development 、 Permissions and configuration, as well as task performance monitoring and alarm , At the same time, it can also manage the data source well .

The demand for data synchronization is particularly strong within the company , The development efficiency needs to be improved through the platform , Speed up delivery . And after the platform , It can unify the data synchronization technology within the company , Collapse the synchronization stack , Reduce maintenance costs .

The goals of platformization are as follows ：

Be able to manage data sources well 、 Table and other meta information ;
The whole life cycle of the task can be completed on the platform ;
Realize task performance observation and alarm ;
Simplify the development , Quick start , Business developers can start to develop synchronization tasks after simple training .

Platformization can bring benefits in the following three aspects ：

Collapse data synchronization task , To manage ;
The whole life cycle of platform management and maintenance synchronization tasks ;
A dedicated team is responsible for , The team can focus on cutting-edge data integration technologies .

With the platform , You can quickly apply more business scenarios .

Real time data warehouse ： Hope to pass Flink CDC To support more real-time data warehouse business scenarios , With the help of Flink Powerful computing power to do some materialized views of the database . Calculate from DB Free from , adopt Flink And then write back to the database , To accelerate the reporting of platform applications 、 Statistics 、 Analysis and other real-time application scenarios .
Real time applications ：Flink CDC Can from DB Layer capture changes , So you can use Flink CDC Real time update of content in search engine , Push financial and accounting data to the financial system in real time . Because most of the data in the financial system needs the business system to run scheduled tasks and go through a lot of correlation 、 polymerization 、 Grouping and other operations can be calculated , And then push it to the financial system . And with the help of Flink CDC Powerful data capture capability , Plus Flink Computing power , Push these data to the accounting system and financial system in real time , You can find problems in the business in time , Reduce the company's losses .
cache ： adopt Flink CDC, It can build a real-time cache that breaks away from traditional applications , Greatly improve the performance of online applications .

With the help of the platform , Believe in Flink CDC It can better release its ability within the company .

Four 、 Community cooperation

We hope to carry out diversified cooperation with the community , To improve the quality of our code and the company's open source cooperation ability . Community cooperation will be carried out mainly through three aspects ：

First of all , Open source co construction . I hope to have more opportunities to communicate and share with my peers Flink CDC Experience and access scenarios in the company , Training will also be conducted internally Flink CDC technology , Let everyone know through training Flink CDC technology , This technology can be used to solve more business pain points in practical work .
At present, the cooperation between the company and the community has achieved some results , The company has contributed to the community SqlServer CDC Connector And cooperation completed TiDB CDC Connector.

second , Serving the community . Cultivate the open source cooperation ability of department development , And contribute the features of the company's internal version to the community , Only after being polished by the majority of users in the community , Features can be more stable and reasonable . Besides , Also hope to be able to schema evolution、turning performance、 Close cooperation with the community in the direction of synchronization of the whole database .

Third , Explore the direction . Believe in Flink CDC Will not be satisfied with the achievements of the present , Will certainly continue to move towards further goals . So I hope to explore with the community Flink CDC More possible directions .

The recent cooperation between the company and the community is ： take SqlServer CDC The features implemented based on the concurrent lock free framework contribute to the community .

The picture above shows SqlServer CDC Principle .

First , SqlServer Data changes will be recorded to transaction log in , Match through the captured process log In the open CDC table Of log journal , Insert the matched log into after conversion CDC Generated change tables in , In the end by the SqlServer CDC call CDC query function Get... In real time insert、update、delete as well as DDL sentence , And then convert it into Flink Inside OpType and RawData Do calculations 、 Entering the lake, etc .

Community students use the current version of SqlServer CDC after , The main feedback questions are as follows ：

Lock the table during snapshot ： The table locking operation is for DBA And online applications are intolerable , DBA Unable to accept database tampering , It will also affect online applications .
During snapshot, you cannot checkpoint： You can't checkpoint It means that once the snapshot process fails , You can only restart the snapshot process , This is very unfriendly to big watches .
Snapshot process only supports single concurrency ： Tens of millions 、 Hundreds of millions of big watches , In the case of single concurrency, it needs to synchronize for more than ten or even dozens of hours , A great constraint on SqlServer CDC Application scenarios of .

We have practiced and improved the above problems , Reference community 2.0 edition MySQL CDC The idea of concurrent lock free algorithm , Yes SqlServer CDC optimized , Finally, no lock in the snapshot process is realized , Implement consistent snapshots ; Support during snapshot checkpoint ; Concurrency is supported during snapshot , Speed up the snapshot process . In the case of large table synchronization , The advantage of concurrency is particularly obvious .

But because of 2.2 The version community will MySQL The idea of concurrency without lock is abstracted into a unified public framework ,SqlServer CDC This common framework needs to be re adapted before it can be contributed to the community .

Question and answer

Q： Need to open SqlServer Their own CDC Do you ？

A： Yes ,SqlServer CDC The function of is based on SqlServer The database itself CDC Feature implementation .

Q： How do materialized views refresh scheduled task triggers ？

A： adopt Flink CDC Will need to generate materialized views of SQL Put it in Flink Running in , The calculation is triggered by the change of the original table , Then synchronize to the materialized view table .

Q： How does platformization work ？

A： Platformization refers to many open source projects and excellent open source platforms in the community , such as StreamX、DLink Excellent open source projects .

Q：SqlServer CDC In the consumer transaction log Is there a bottleneck ？

A：SqlServer No direct consumption log, The principle is SqlServer capture process To match log Which tables in the are turned on CDC , Then pull these tables from the log and open them CDC Change data of table , Then insert into change table in , Finally by opening CDC After that, the database generates CDC query function Get changes to data .

Q：Flink CDC How does high availability ensure that there are too many synchronization tasks or intensive processing schemes ？

A：Flink High availability of depends on Flink Features such as checkpoint Wait to make sure . Too many synchronization tasks or intensive processing schemes , Multiple sets are recommended Flink Downstream clusters , And then treat them differently according to the real-time nature of synchronization , Publish the task to the corresponding cluster .

Q： In the middle Kafka Do you ？

A： It depends on whether the synchronization task or data warehouse architecture needs to do the intermediate data Kafka to ground .

Q： There are multiple tables in a database , Can I put it into a task to run ？

A： It depends on the development method . If it is SQL Development mode , To write multiple tables at one time, you can only use multiple tasks . but Flink CDC It provides another high-level development method DataStream , You can put multiple tables into one task to run .

Q：Flink CDC Support reading Oracle Log from the library ？

A： It is not yet possible to achieve .

Q： adopt CDC How to monitor the data quality of the two terminals after synchronization , How to compare ？

A： At present, only regular sampling can be used to check the data quality , Data quality has always been a thorny problem in the industry .

Q： What scheduling system does Dajian cloud warehouse use ？ How does the system work with Flink CDC aggregate ？

A： Use XXL Job As a distributed task scheduling ,CDC No scheduled tasks are used .

Q： If the collection add / delete table ,SqlServer CDC Do you need to restart ？

A：SqlServer CDC Currently, the function of adding tables dynamically is not supported .

Q： Will synchronization tasks affect system performance ？

A： be based on CDC Doing synchronization tasks will definitely affect the system performance , In particular, the snapshot process will have an impact on the database , And then affect the application system . The community will limit the flow in the future 、 For all connector Do concurrent lock free implementation , All for expansion CDC Application scenarios and ease of use .

Q： Full and incremental savepoint How to deal with ？

A：（ Connectors not implemented by the concurrent lock free framework ） It is not allowed to trigger in the whole process savepoint, In the incremental process, if you need to stop the release , It can be done by savepoint Recovery task .

Q：CDC Synchronize data to Kafka , and Kafka What's in it is Binlog , How to save historical data and real-time data ？

A： take CDC All synchronized data Sync To Kafka, The data retained depends on Kafka log The clean-up strategy , You can keep them all .

Q：CDC Would be right Binlog To filter the log operation type of ？ Will it affect efficiency ？

A： Even with filtering , It has little impact on performance .

Q：CDC read MySQL Initialize the snapshot phase , When multiple programs read different tables, the program will report an error and cannot obtain the permission to lock the table , What's the reason ？

A： It is recommended to check it first MySQL CDC Whether to use the old way to realize , You can try a new version of the concurrent lock free implementation .

Q：MySQL How to link up the full volume and increment of hundreds of millions of large-scale meters ？

A： It is suggested that the teacher read Xuejin in 2.0 Related blogs , It is very simple and clear to introduce how to realize consistent snapshot without lock , Complete the switch between full volume and increment .

Click to view live playback & speech PDF

原网站

版权声明
本文为[Apache Flink]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/162/202206111832567698.html