当前位置:网站首页>Data Lake: Data Integration Tool DataX
Data Lake: Data Integration Tool DataX
2022-07-30 04:09:00 【YoungerChina】
Series of topics: Data Lake Series Articles
1. What is DataX
DataX is the open source version of Alibaba Cloud DataWorks data integration, which is mainly used to realize offline synchronization between data.DataX is committed to realizing stable and efficient data synchronization between various heterogeneous data sources (ie different databases) including relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, etc.
In order to solve the synchronization problem of heterogeneous data sources, DataX turns the complex mesh synchronization link into a star data link, and DataX is responsible for connecting various data sources as an intermediate transmission carrier;
When you need to access a new data source, you only need to connect this data source to DataX, and you can synchronize with the existing data source as seamless data.
2. DataX3.0 framework design
DataX adopts the Framework + Plugin architecture, which abstracts the reading and writing of data sources as Reader/Writer plugins and incorporates them into the entire synchronization framework.
Reader (Acquisition Module)
Responsible for collecting data from the data source and sending the data to
Framework
.
Writer
Responsible for continuously fetching data from the Framework and writing the data to the destination.
Framework (Intermediary)
Responsible for connecting Reader and Writer as a data transmission channel for both, and handling core technical issues such as buffering, flow control, concurrency, and data conversion.
3.DataX3.0 Core Architecture
DataX completes a single data synchronization job, called Job. After DataX receives a Job, it will start a process to complete the entire job synchronization process.The DataX Job module is the central management node of a single job, and undertakes functions such as data cleaning, subtask segmentation, and TaskGroup management.
After the DataX Job is started, it will divide the Job into multiple small Tasks (subtasks) according to the segmentation strategies of different sources to facilitate concurrent execution.
Then the DataX Job will call the Scheduler module to reassemble the split tasks into a TaskGroup (task group) according to the configured number of concurrent tasks
Each Task is started by the TaskGroup. After the Task is started, the Reader --> Channel --> Writer thread will be started fixedly to complete the task synchronization.
After the DataX job is started, the Job will monitor the TaskGroup. After all the TaskGroups are completed, the Job will exit successfully (the value is not 0 when it exits abnormally)
DataX scheduling process:
- First, the DataX Job module will be divided into several tasks according to the sub-database and sub-table;
- Then calculate how many TaskGroups need to be allocated according to the number of concurrent user configuration: calculation process:
Task / Channel = TaskGroup
,- Finally, the TaskGroup runs the Task (task) according to the allocated concurrency
4. References
[01]https://www.jb51.net/article/241637.htm
边栏推荐
- Uptime Monitoring: How to Ensure Network Device Uptime
- vscode debugging and remote
- FreeRTOS Personal Notes - Memory Management
- 小程序毕设作品之微信积分商城小程序毕业设计成品(5)任务书
- Roperties类配置文件&DOS查看主机网络情况
- 宇宙的尽头是银行?聊聊在银行做软件测试的那些事
- Microservice CAP Principles
- 2022.7.29-----leetcode.593
- What are Redis server startup after the operation?
- Smart answer function, CRMEB knowledge payment system must have!
猜你喜欢
Mini Program Graduation Works WeChat Points Mall Mini Program Graduation Design Finished Products (1) Development Overview
How does the Snapdragon 7 series chip perform?Reno8 Pro proves a new generation of God U
[The Mystery of Cloud Native] Cloud Native Background && Definition && Detailed explanation of related technologies?
Mini Program Graduation Works WeChat Points Mall Mini Program Graduation Design Finished Work (7) Interim Inspection Report
新型LaaS协议Elephant Swap给ePLATO提供可持续溢价空间
小程序毕设作品之微信二手交易小程序毕业设计成品(4)开题报告
CMake installation and testing
小程序毕设作品之微信积分商城小程序毕业设计成品(7)中期检查报告
spicy (1) basic definition
Detailed transport layer
随机推荐
Atomic Guarantees of Redis Distributed Locks
宇宙的尽头是银行?聊聊在银行做软件测试的那些事
海外多家权威媒体热议波场TRON:为互联网去中心化奠定基础
为什么突然间麒麟 9000 5G 版本,又有库存了?
Pytorch框架学习记录5——DataLoader的使用
[ 云原生之谜 ] 云原生背景 && 定义 && 相关技术详解?
redis分布式锁的原子保证
WEB penetration of information collection
2022.7.29-----leetcode.593
(redistribute, special comprehensive experiment ospf area)
Mini Program Graduation Works WeChat Points Mall Mini Program Graduation Design Finished Product (8) Graduation Design Thesis Template
Mysql版本升级,直接复制Data文件,查询特别慢
Smart answer function, CRMEB knowledge payment system must have!
Taobao/Tmall get Taobao store details API
spicy (two) unit hooks
Pytorch framework learning record 4 - the use of datasets (torchvision.dataset)
mysql 结构、索引详解
Pytorch框架学习记录6——torch.nn.Module和torch.nn.functional.conv2d的使用
spicy(二)unit hooks
OA Project Pending Meeting & History Meeting & All Meetings