当前位置：网站首页>Study notes of dataX

Study notes of dataX

2022-07-26 09:05:00 【It is not easy to live in vain】

Datax Learning notes of

List of articles

Datax Learning notes of

1. brief introduction

Datax It is widely used in Alibaba Group Offline synchronization tool for heterogeneous data sources , Committed to achieving including Relational database （MySQL、Oracle etc. ）、HDFS、Hive、MaxCompute（ primary ODPS）、HBase、FTP etc. Various heterogeneous data sources （ That is, different databases ） Stable and efficient data synchronization between .

1.1 Design concept

Insert picture description here

In order to solve the problem of heterogeneous data source synchronization ,DataX Turn the complex mesh synchronization link into a star data link ,DataX As an intermediate transmission carrier, it is responsible for connecting various data sources . When you need to access a new data source , Just connect this data source pair to DataX, It can achieve seamless data synchronization with existing data sources .

1.2 framework design

DataX As an offline data synchronization framework , use Framework + plugin Architecture building . Abstract data source read and write as Reader/Writer plug-in unit , Integrated into the entire synchronization framework .

Reader：Reader Data acquisition module , Responsible for collecting data from data sources , Send the data to Framework.
Writer：Writer Write module for data , To be responsible for keeping up with Framework Take data from , And write the data to the destination .
Framework：Framework Used to connect to reader and writer, As the data transmission channel of both , And handle buffers , Flow control , Concurrent , Data conversion and other core technical issues .
DataX 3.0 The open source version supports single machine multithreading mode to complete synchronous job running . Details refer to ： Alicloud open source offline synchronization tool DataX3.0 Introduce

1.3 advantage

Reliable data quality monitoring （ So that the data can be transmitted to the destination without damage ）
Rich data conversion function
Precise speed control
The new version DataX3.0 Provided including access （ Concurrent ）、 Record stream 、 Byte stream has three flow control modes , You can control your homework speed at will , Let your work achieve the best synchronization speed within the range that the library can bear .
Strong synchronization performance ： Each reading plug-in has one or more segmentation strategies , Can reasonably divide the homework into multiple Task Parallel execution , The single machine multithreaded execution model can make DataX The speed increases linearly with concurrency .
Robust fault tolerance mechanism （ Multilevel local / Global retry ）
Minimalist experience . Download and use 、 Detailed log information .

1.4 System requirements

Linux
JDK（1.8 above , recommend 1.8）
Python（ recommend Python2.6 X）
Apache Maven 3.x（Compile DataX）

1.4 build

Official website steps ：https://github.com/alibaba/DataX/blob/master/userGuid.md

2. Relevant concepts

Heterogeneous data sources

Refers to data between different database management systems . In the process of enterprise information construction , Due to the phased construction and implementation of data management system in each business system 、 Technical and other economic and human factors , As a result, enterprises have accumulated a large number of business data with different storage methods in the process of development , Including the data management systems used , From simple file database to complex network database , They constitute the heterogeneous data source of the enterprise .

The heterogeneity of enterprise data sources is mainly manifested in 3 aspect ：

System heterogeneity , That is, the business application system on which the data source depends 、 The differences between database management systems and even operating systems constitute system heterogeneity .
Pattern heterogeneity , That is, the data source is different in storage mode . Storage patterns mainly include relational patterns 、 Object mode 、 Object relationship pattern and document nesting , Among them, the relationship mode （ relational database ） It is the mainstream storage mode . meanwhile , Even the same storage mode , There may also be differences in their model structures . For example, the data types of different relational data management systems are not completely consistent , Such as DB2、Oracle、Sybase、Informix、SQL Server、Foxpro etc. .
Heterogeneous sources , That is, the heterogeneity between internal data sources and external data sources .

3. DataX3.0 Core architecture

DataX Complete the operation of single data passing , We become Job,DataX Received a Job after , A process will be started to complete the job synchronization process .DataX Job The module is the central management node of a single job , Data cleaning 、 Sub task segmentation 、TaskGroup Management and other functions .

Insert picture description here

DataX Job After starting , According to the segmentation strategies of different sources , take Job Cut into smaller ones Task （ The subtasks ）, To facilitate concurrent execution .
next DataX Job Would call Scheduler modular , According to the configured concurrent number , To divide into Task Back together , Assemble into TaskGroup （ Task force ）
every last Task All by TaskGroup Responsible for starting ,Task After starting , It will start in a fixed way Reader --> Channel --> Writer Thread to complete task synchronization .
DataX After the job is started ,Job Would be right TaskGroup Conduct monitoring operation , Wait for all TaskGroup After completion ,Job Will be successfully launched （ Abnormal exit Value not 0）

Composition Introduction ：

Job： The management node of a single job , Responsible for data cleaning 、 Subtask Division 、TashGroup Monitoring management .
Task： from Job Cut it up , yes DataX The smallest unit of work , Every Task Be responsible for the synchronization of some data .
Schedule： take Task form TaskGroup, Single TaskGroup The number of concurrent is 5.
TaskGroup： Responsible for starting Task.

DataX Scheduling process ：

First DataX Job The module will be divided into several modules according to the sub database and sub table Task, Then, according to the user configuration, the number of concurrent , To calculate how many... Need to be allocated TaskGroup;
The calculation process ：Task / Channel = TaskGroup, Finally by TaskGroup Run according to the allocated concurrency number Task （ Mission ）

原网站

版权声明
本文为[It is not easy to live in vain]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207260902579339.html