当前位置:网站首页>Data Lake: Data Integration Tool DataX
Data Lake: Data Integration Tool DataX
2022-07-30 04:09:00 【YoungerChina】
Series of topics: Data Lake Series Articles
1. What is DataX
DataX is the open source version of Alibaba Cloud DataWorks data integration, which is mainly used to realize offline synchronization between data.DataX is committed to realizing stable and efficient data synchronization between various heterogeneous data sources (ie different databases) including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.
In order to solve the synchronization problem of heterogeneous data sources, DataX turns the complex mesh synchronization link into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier;

When you need to access a new data source, you only need to connect this data source to DataX, and you can synchronize with the existing data source as seamless data.
2. DataX3.0 framework design
DataX adopts the Framework + Plugin architecture, which abstracts the reading and writing of data sources as Reader/Writer plugins and incorporates them into the entire synchronization framework.

Reader (Acquisition Module)
Responsible for collecting data from the data source and sending the data to
Framework.
Writer
Responsible for continuously fetching data from the Framework and writing the data to the destination.
Framework (Intermediary)
Responsible for connecting Reader and Writer as a data transmission channel for both, and handling core technical issues such as buffering, flow control, concurrency, and data conversion.
3.DataX3.0 Core Architecture
DataX completes a single data synchronization job, called Job. After DataX receives a Job, it will start a process to complete the entire job synchronization process.The DataX Job module is the central management node of a single job, and undertakes functions such as data cleaning, subtask segmentation, and TaskGroup management.

After the DataX Job is started, it will divide the Job into multiple small Tasks (subtasks) according to the segmentation strategies of different sources to facilitate concurrent execution.
Then the DataX Job will call the Scheduler module to reassemble the split tasks into a TaskGroup (task group) according to the configured number of concurrent tasks
Each Task is started by the TaskGroup. After the Task is started, the Reader --> Channel --> Writer thread will be started fixedly to complete the task synchronization.
After the DataX job is started, the Job will monitor the TaskGroup. After all the TaskGroups are completed, the Job will exit successfully (the value is not 0 when it exits abnormally)
DataX scheduling process:
- First, the DataX Job module will be divided into several tasks according to the sub-database and sub-table;
- Then calculate how many TaskGroups need to be allocated according to the number of concurrent user configuration: calculation process:
Task / Channel = TaskGroup,- Finally, the TaskGroup runs the Task (task) according to the allocated concurrency
4. References
[01]https://www.jb51.net/article/241637.htm
边栏推荐
- Nacos installation and deployment
- Mini Program Graduation Works WeChat Points Mall Mini Program Graduation Design Finished Products (6) Question Opening and Defense PPT
- MySQL data query (subtotal and sorting)
- Pytorch framework to study record 6 - the torch. Nn. The Module and the torch nn. Functional. The use of conv2d
- 运行时间监控:如何确保网络设备运行时间
- 智能答题功能,CRMEB知识付费系统必须有!
- 2022-07-29 Group 4 Self-cultivation class study notes (every day)
- Pytorch框架学习记录2——TensorBoard的使用
- The curl command to get the network IP
- 解决编译安装gdb-10.1 unistd.h:663:3: error: #error “Please include config.h first.“ 问题
猜你喜欢

day10--install mysql on linux

The difference between forward and redirect

How to Effectively Conduct Retrospective Meetings (Part 1)?

小程序毕设作品之微信积分商城小程序毕业设计成品(6)开题答辩PPT

小程序毕设作品之微信积分商城小程序毕业设计成品(5)任务书

Roperties class configuration file & DOS to view the host network situation

CMake installation and testing

Pytorch框架学习记录2——TensorBoard的使用
![Advanced [C] array to participate in the function pointer](/img/00/67dd77463670c8ebd5d004dbe12549.jpg)
Advanced [C] array to participate in the function pointer

vscode debugging and remote
随机推荐
Mini Program Graduation Works WeChat Points Mall Mini Program Graduation Design Finished Products (3) Background Functions
小程序毕设作品之微信二手交易小程序毕业设计成品(5)任务书
After 5 years of Ali internship interview~
[ 云原生之谜 ] 云原生背景 && 定义 && 相关技术详解?
Smart answer function, CRMEB knowledge payment system must have!
SQLSERVER merges subquery data into one field
Why is the Kirin 9000 5G version suddenly back in stock?
Uptime Monitoring: How to Ensure Network Device Uptime
Mini Program Graduation Works WeChat Points Mall Mini Program Graduation Design Finished Product (8) Graduation Design Thesis Template
Roperties class configuration file & DOS to view the host network situation
Pytorch framework learning record 7 - convolutional layer
Reverse Analysis Practice 2
sqlmap use tutorial Daquan command Daquan (graphics)
数组和结构体
Send it to your friends and let TA treat you to fried chicken!
一起来学习flutter 的布局组件
智能答题功能,CRMEB知识付费系统必须有!
spicy (1) basic definition
2022-07-29 Group 4 Self-cultivation class study notes (every day)
Tcp programming