当前位置:网站首页>Study notes of dataX
Study notes of dataX
2022-07-26 09:05:00 【It is not easy to live in vain】
Datax Learning notes of
List of articles
1. brief introduction
Datax It is widely used in Alibaba Group Offline synchronization tool for heterogeneous data sources , Committed to achieving including Relational database (MySQL、Oracle etc. )、HDFS、Hive、MaxCompute( primary ODPS)、HBase、FTP etc. Various heterogeneous data sources ( That is, different databases ) Stable and efficient data synchronization between .
1.1 Design concept

In order to solve the problem of heterogeneous data source synchronization ,DataX Turn the complex mesh synchronization link into a star data link ,DataX As an intermediate transmission carrier, it is responsible for connecting various data sources . When you need to access a new data source , Just connect this data source pair to DataX, It can achieve seamless data synchronization with existing data sources .
1.2 framework design
DataX As an offline data synchronization framework , use Framework + plugin Architecture building . Abstract data source read and write as Reader/Writer plug-in unit , Integrated into the entire synchronization framework .
Reader:Reader Data acquisition module , Responsible for collecting data from data sources , Send the data to Framework.
Writer:Writer Write module for data , To be responsible for keeping up with Framework Take data from , And write the data to the destination .
Framework:Framework Used to connect to reader and writer, As the data transmission channel of both , And handle buffers , Flow control , Concurrent , Data conversion and other core technical issues .
DataX 3.0 The open source version supports single machine multithreading mode to complete synchronous job running . Details refer to : Alicloud open source offline synchronization tool DataX3.0 Introduce
1.3 advantage
- Reliable data quality monitoring ( So that the data can be transmitted to the destination without damage )
- Rich data conversion function
- Precise speed control
- The new version DataX3.0 Provided including access ( Concurrent )、 Record stream 、 Byte stream has three flow control modes , You can control your homework speed at will , Let your work achieve the best synchronization speed within the range that the library can bear .
- Strong synchronization performance : Each reading plug-in has one or more segmentation strategies , Can reasonably divide the homework into multiple Task Parallel execution , The single machine multithreaded execution model can make DataX The speed increases linearly with concurrency .
- Robust fault tolerance mechanism ( Multilevel local / Global retry )
- Minimalist experience . Download and use 、 Detailed log information .
1.4 System requirements
- Linux
- JDK(1.8 above , recommend 1.8)
- Python( recommend Python2.6 X)
- Apache Maven 3.x(Compile DataX)
1.4 build
Official website steps :https://github.com/alibaba/DataX/blob/master/userGuid.md
2. Relevant concepts
Heterogeneous data sources
Refers to data between different database management systems . In the process of enterprise information construction , Due to the phased construction and implementation of data management system in each business system 、 Technical and other economic and human factors , As a result, enterprises have accumulated a large number of business data with different storage methods in the process of development , Including the data management systems used , From simple file database to complex network database , They constitute the heterogeneous data source of the enterprise .
The heterogeneity of enterprise data sources is mainly manifested in 3 aspect :
- System heterogeneity , That is, the business application system on which the data source depends 、 The differences between database management systems and even operating systems constitute system heterogeneity .
- Pattern heterogeneity , That is, the data source is different in storage mode . Storage patterns mainly include relational patterns 、 Object mode 、 Object relationship pattern and document nesting , Among them, the relationship mode ( relational database ) It is the mainstream storage mode . meanwhile , Even the same storage mode , There may also be differences in their model structures . For example, the data types of different relational data management systems are not completely consistent , Such as DB2、Oracle、Sybase、Informix、SQL Server、Foxpro etc. .
- Heterogeneous sources , That is, the heterogeneity between internal data sources and external data sources .
3. DataX3.0 Core architecture
DataX Complete the operation of single data passing , We become Job,DataX Received a Job after , A process will be started to complete the job synchronization process .DataX Job The module is the central management node of a single job , Data cleaning 、 Sub task segmentation 、TaskGroup Management and other functions .

- DataX Job After starting , According to the segmentation strategies of different sources , take Job Cut into smaller ones Task ( The subtasks ), To facilitate concurrent execution .
- next DataX Job Would call Scheduler modular , According to the configured concurrent number , To divide into Task Back together , Assemble into TaskGroup ( Task force )
- every last Task All by TaskGroup Responsible for starting ,Task After starting , It will start in a fixed way Reader --> Channel --> Writer Thread to complete task synchronization .
- DataX After the job is started ,Job Would be right TaskGroup Conduct monitoring operation , Wait for all TaskGroup After completion ,Job Will be successfully launched ( Abnormal exit Value not 0)
Composition Introduction :
- Job: The management node of a single job , Responsible for data cleaning 、 Subtask Division 、TashGroup Monitoring management .
- Task: from Job Cut it up , yes DataX The smallest unit of work , Every Task Be responsible for the synchronization of some data .
- Schedule: take Task form TaskGroup, Single TaskGroup The number of concurrent is 5.
- TaskGroup: Responsible for starting Task.
DataX Scheduling process :
- First DataX Job The module will be divided into several modules according to the sub database and sub table Task, Then, according to the user configuration, the number of concurrent , To calculate how many... Need to be allocated TaskGroup;
- The calculation process :
Task / Channel = TaskGroup, Finally by TaskGroup Run according to the allocated concurrency number Task ( Mission )
边栏推荐
- 2022茶艺师(中级)特种作业证考试题库模拟考试平台操作
- Hegong sky team vision training Day6 - traditional vision, image processing
- Datawhale panda book has been published!
- Pytoch learning - from tensor to LR
- Datax的学习笔记
- Canal 的学习笔记
- Pat grade a a1076 forwards on Weibo
- 数据库操作 技能6
- 堆外内存的使用
- Learn more about the difference between B-tree and b+tree
猜你喜欢

Day06 homework - skill question 7

JDBC database connection pool (Druid Technology)

布隆过滤器

Regular expression: judge whether it conforms to USD format

The idea shortcut key ALT realizes the whole column operation

Study notes of automatic control principle -- dynamic model of feedback control system

Study notes of automatic control principle -- correction and synthesis of automatic control system

TCP solves the problem of short write

对标注文件夹进行清洗

day06 作业---技能题7
随机推荐
CSDN TOP1“一个处女座的程序猿“如何通过写作成为百万粉丝博主?
高数 | 武爷『经典系列』每日一题思路及易错点总结
围棋智能机器人阿法狗,阿尔法狗机器人围棋
ES6模块化导入导出)(实现页面嵌套)
第6天总结&数据库作业
Pytoch realizes logistic regression
day06 作业--技能题2
JDBC数据库连接池(Druid技术)
Learn more about the difference between B-tree and b+tree
Overview of motion recognition evaluation
Pan micro e-cology8 foreground SQL injection POC
PAT 甲级 A1013 Battle Over Cities
The Child and Binary Tree-多项式开根求逆
Study notes of automatic control principle --- stability analysis of control system
Two tips for pycharm to open multiple projects
李沐d2l(五)---多层感知机
【LeetCode数据库1050】合作过至少三次的演员和导演(简单题)
布隆过滤器
idea快捷键 alt实现整列操作
Day06 homework - skill question 6