当前位置:网站首页>Data Lake: Data Integration Tool DataX
Data Lake: Data Integration Tool DataX
2022-07-30 04:09:00 【YoungerChina】
Series of topics: Data Lake Series Articles
1. What is DataX
DataX is the open source version of Alibaba Cloud DataWorks data integration, which is mainly used to realize offline synchronization between data.DataX is committed to realizing stable and efficient data synchronization between various heterogeneous data sources (ie different databases) including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.
In order to solve the synchronization problem of heterogeneous data sources, DataX turns the complex mesh synchronization link into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier;

When you need to access a new data source, you only need to connect this data source to DataX, and you can synchronize with the existing data source as seamless data.
2. DataX3.0 framework design
DataX adopts the Framework + Plugin architecture, which abstracts the reading and writing of data sources as Reader/Writer plugins and incorporates them into the entire synchronization framework.

Reader (Acquisition Module)
Responsible for collecting data from the data source and sending the data to
Framework.
Writer
Responsible for continuously fetching data from the Framework and writing the data to the destination.
Framework (Intermediary)
Responsible for connecting Reader and Writer as a data transmission channel for both, and handling core technical issues such as buffering, flow control, concurrency, and data conversion.
3.DataX3.0 Core Architecture
DataX completes a single data synchronization job, called Job. After DataX receives a Job, it will start a process to complete the entire job synchronization process.The DataX Job module is the central management node of a single job, and undertakes functions such as data cleaning, subtask segmentation, and TaskGroup management.

After the DataX Job is started, it will divide the Job into multiple small Tasks (subtasks) according to the segmentation strategies of different sources to facilitate concurrent execution.
Then the DataX Job will call the Scheduler module to reassemble the split tasks into a TaskGroup (task group) according to the configured number of concurrent tasks
Each Task is started by the TaskGroup. After the Task is started, the Reader --> Channel --> Writer thread will be started fixedly to complete the task synchronization.
After the DataX job is started, the Job will monitor the TaskGroup. After all the TaskGroups are completed, the Job will exit successfully (the value is not 0 when it exits abnormally)
DataX scheduling process:
- First, the DataX Job module will be divided into several tasks according to the sub-database and sub-table;
- Then calculate how many TaskGroups need to be allocated according to the number of concurrent user configuration: calculation process:
Task / Channel = TaskGroup,- Finally, the TaskGroup runs the Task (task) according to the allocated concurrency
4. References
[01]https://www.jb51.net/article/241637.htm
边栏推荐
- WEB 渗透之信息收集
- Flutter records and learns different animations (1)
- spicy (1) basic definition
- Pytorch framework learning record 5 - the use of DataLoader
- Basic introduction to protect the network operations
- RRU、BBU、AAU
- 小程序毕设作品之微信二手交易小程序毕业设计成品(5)任务书
- Roperties class configuration file & DOS to view the host network situation
- Mini Program Graduation Works WeChat Second-hand Trading Mini Program Graduation Design Finished Works (7) Interim Inspection Report
- Drools (7): WorkBench
猜你喜欢

Wechat second-hand transaction small program graduation design finished product (1) Development overview

OA Project Pending Meeting & History Meeting & All Meetings

spicy (1) basic definition

小程序毕设作品之微信积分商城小程序毕业设计成品(4)开题报告

How to Effectively Conduct Retrospective Meetings (Part 1)?

(6) "Digital Electricity" - Diodes and CMOS Gate Circuits (Introduction)

Mini Program Graduation Works WeChat Points Mall Mini Program Graduation Design Finished Product (8) Graduation Design Thesis Template

小程序毕设作品之微信二手交易小程序毕业设计成品(4)开题报告

Summary of Rpc and gRpc Introduction
![Reverse Theory Knowledge 3 [UI Modification]](/img/f3/33db96f3dd149658859be58041ab43.png)
Reverse Theory Knowledge 3 [UI Modification]
随机推荐
Atomic Guarantees of Redis Distributed Locks
How does the AI intelligent security video platform EasyCVR configure the simultaneous transmission of audio and video?
[Switch] Protocol-Oriented Programming in Swift: Introduction
New LaaS protocol Elephant Swap provides ePLATO with sustainable premium space
Pytorch框架学习记录3——Transform的使用
flutter 记录学习不一样的动画(二)
发给你的好友,让 TA 请你吃炸鸡!
redis分布式锁的原子保证
RRU, BBU, AAU
[Node accesses MongoDB database]
How to solve the error "no such file or directory" when EasyCVR starts?
小程序毕设作品之微信二手交易小程序毕业设计成品(2)小程序功能
spicy(一)基本定义
spicy (1) basic definition
Hongji was once again shortlisted in the Gartner 2022 RPA Magic Quadrant and achieved a significant jump in position
WEB 渗透之信息收集
解决编译安装gdb-10.1 unistd.h:663:3: error: #error “Please include config.h first.“ 问题
Pytorch framework to study record 6 - the torch. Nn. The Module and the torch nn. Functional. The use of conv2d
小程序毕设作品之微信积分商城小程序毕业设计成品(2)小程序功能
小程序毕设作品之微信积分商城小程序毕业设计成品(1)开发概要