当前位置:网站首页>Visual task scheduling & drag and drop | scalph data integration based on Apache seatunnel

Visual task scheduling & drag and drop | scalph data integration based on Apache seatunnel

2022-07-04 22:05:00 InfoQ



This time  6 month  Meetup  What I bring to you is Scaleph  be based on  Apache SeaTunnel (Incubating)  Introduction to data integration , Hope you found something .

This speech mainly includes five parts :

  • About Scaleph
  • Scaleph framework & Function introduction
  • SeaTunnel Community contribution
  • System demonstration
  • Development plan



null
Wang Qi
Apache SeaTunnel Contributor
Search for recommended engineers , big data  Java  Development

01 Scaleph The origin of


I was first engaged in search recommendation , Be responsible for maintenance in the team Dump System , Mainly for our search engine
Provide the function of feeding data
, First of all, let me introduce the main problems in the maintenance process 5 Pain points :

null

Timeliness and stability


Search recommendation is the core online system of e-commerce platform , In particular, the timeliness and stability of data are highly required . Because search recommendations will receive
The whole e-commerce platform  C  End
Most of the traffic , So once the service fluctuates , It may cause service damage , The user experience is greatly reduced .

Business complex / Wide table design


Dump The system will integrate the e-commerce platform
goods 、 category 、 brand 、 The store 、 Product label 、 Real time data warehouse / Offline data and model data
It will go through a series of pretreatment , The final output is a wide table , In the process , The complexity and variability of business , Will invade Dump From the system , Therefore, the technical challenges to deal with are relatively higher .

Total quantity + Real time index


Run the full index once a day , The main purpose is to update
 T+1  frequency
Updated data . When the full index is finished , We will refresh the data that needs to be updated in real time through real-time index , For example, the price of goods 、 Information about inventory changes .

Data linkage update


We have many upstream data sources , Yes
Message queue 、 database 、 Big data related storage and  dubbo  Interface
, Because it is a wide table design , Take the commodity index as an example , The wide watch will be mainly commodity , If it is a store index , Mainly stores , Depending on the data , Upstream data changes are not necessarily product or store dimensions , Data will also produce certain linkage updates .

The bottom line of the data


Search recommendation service was also responsible for C Most of the traffic at the end , When the performance of other teams in the company cannot keep up , They usually pass the data Dump The system sends it to the search engine , Then our team returned to Web page , Avoid subsequent secondary request calls to them .

meanwhile , If other teams' business systems generate dirty data , Also needed Dump The system does data protection , Prevent data leakage to C End users have a bad impact , So when developing and maintaining , It's also very difficult .

02  Why introduce Flink?


As a domestic  Flink  Early users of , Alibaba has a long history and successful experience in the field of search recommendation , Develop and maintain in the search recommendation team  Dump  The professional experience of the system prompted me to pay attention to using Flink do
A/B Experimental report 、 Real time data flow
Other related work , It mainly uses Flink To achieve Dump The system provides for search Dump Platform work , Use Flink Doing data integration has 5 advantage :

null
  • Natural distributed support :Flink Support multiple deployment and operation modes , stand-alone 、yarn、Kubernetes;
  • Low latency 、 Massive throughput : It is widely used in many large factories ;
  • Ecological support :Flink There are many out of the box connector, Support csv、avro data format ,kafka、pulsar And many storage systems , Closely integrate with the big data ecosystem ;
  • Based on distributed lightweight asynchronous snapshot mechanism exactly-once semantics , For the failure of the task 、 restart 、 transfer 、 Upgrade, etc. to provide data consistency guarantee ;
  • metrics.Flink In addition to the  metrics  Outside ,metrics  The framework allows users to develop customized  metrics, Enrich monitoring indicators ;


03  Why choose SeaTunnel?


Later contact  SeaTunnel  When , Love it  SeaTunnel  Design concept of !
SeaTunnel  Is running on the  Flink  and Spark  above , Next generation integration framework for high-performance and distributed massive data .

The important thing is that it is out of the box , And it can realize seamless integration for the existing ecosystem , Because it runs on  Flink  and  Spark  above , It can easily access the existing  Flink  and  Spark  Infrastructure of .
On the other hand  SeaTunnel  There are also many production cases
, When entering  Apache  After the foundation incubates , The community is very active , Future period .
null

04  About Scaleph


The starting point of the project


Our initial idea was to  SeaTunnel  Provide  Web  page , Be able to build an open source system for data integration . At present, our main goal is  SeaTunnel 
Make an open source visual data development and management system
, Expect later  Scaleph  It can minimize the development threshold of real-time and offline data tasks , Provide developers with a one-stop data development platform .

Project highlights


In real production applications , During data integration , Arrange or  SQL  Development is the main form of data integration , We think
 Drag and Drop  Visual task choreography
It can minimize the burden of data integration for users ;
The other is to realize multi version management of jobs , Data source support ;
  • Flink The cluster supports multiple versions / Multiple deployment environments ;
  • real time / Periodic tasks are also supported .

null
The above is the architecture diagram of our system , Users mainly use  Web UI, Encapsulated by the job management function  SeaTunnel  operator , Users drag and drop configuration on the page , Automatic system generation SeaTunnel Configuration file for , Finally, through the resources uploaded by users in resource management  
jar
  The bags pass together  Flinkful  Library submitted to  Flink  In the cluster . Resources for resource management  
jar
  The purpose of the package is to support users to upload their own R & D related  
jar
  package , Make up SeaTunnel  Related defects , Or right SeaTunnel and Flink Enhance its functions !

We use it  quartz  Developed a scheduling task , When the task is submitted to  Flink  after , The task will go regularly  Flink  The cluster pulls the task information , Store in  MySQL  Inside , End users are  Web UI  You can see the task related operation information on the page .

Scaleph Function introduction ( Data development )


null

01  project management


Mainly when users create data synchronization tasks , Be able to carry out relevant management work according to different business dimensions .

02  Job management


By dragging and dropping, you can create SeaTunnel Data tasks , Then submit and run accordingly .

03  Resource management

SeaTunnel  In order to  Apache2.0  Open source certificates are open source , And  MySQL  Of  JDBC  The open source protocol of the driver package is incompatible ,SeaTunnel  Of  jdbc connector  Do not provide relevant  JDBC  driving-dependent . When users use  jdbc connector  when ,
It needs to be provided by itself  JDBC  Drive pack .
Here we provide the function of resource management , Users can upload the driver package by themselves , Then take it.  SeaTunnel  The tasks and  MySQL  Drivers are submitted to the cluster together to ensure the normal operation of tasks .

04  Cluster management


Mainly to provide Flink Input of cluster information , At present, it can support Standalone Session  Cluster entry , After the user enters , Submit SeaTunnel Cluster can be selected when working , The task will run in the cluster .

05  Data source management


Support users to enter some data source information in advance , In this way, it is not necessary to input the data source information for each task . meanwhile , You can also share data sources and restrict permissions , Prevent the disclosure of data source information in plaintext .

Scaleph Function introduction ( Operation and maintenance center )


The operation and maintenance center is a running log of real-time tasks and periodic tasks , Users see task related information when submitting tasks , We also provide
Link jump operation
, Users can click to jump to Flink Of Web UI The above to , adopt Flink Official Web UI page , You can see the specific execution information of the task .

null

Scaleph Function introduction ( Data standards )

01  Data element


Data governance is a big system , Everyone is more concerned
Metadata 、 Data consanguinity 、 Data assets
, But data standards are also an important part of data governance , We open source the standard system used internally by the company , Share the relevant knowledge of data standards .

null
In the development process of many data warehouses , Because it is multi person collaboration , Fields with the same meaning , In different model tables , Development will define different fields to express the same meaning and business . The data standard hopes to pass the data element , To unify the model field definitions of data warehouse developers .

02  Reference data


The data in the data warehouse is pulled from the business system through data integration tools , Inevitably, fields with the same meaning will have different definitions in different business systems , These fields with the same meaning and different definitions need to be maintained by warehouse personnel , And the maintenance process is mainly offline documents , Possible
Maintenance is out of date
The situation of .

At the same time, there will also be the problem that business knowledge cannot be directly mapped to data warehouse model information , Data standards allow users to  Web  Maintain these business knowledge in the page .

null
The above figure is a specific case . Here are two business systems defined , One is the system A, One is the system B, They have different gender enumeration values , meanwhile A/B The enumeration description of the system is also different , What to do with that ?

This is the time , We can set a set of unified standards through data warehouse developers , For example, the coding system must be
0,1,2,
The corresponding description is also well defined , Map through a reference data in the middle , Users can easily see .

03  Follow up ideas


Whether it can be in the process of data integration , Directly through data standards Transform  operation , Realize automatic maintenance and mapping of knowledge and models .

04 Scaleph Feature highlights


Visual development of data .
We believe that in the field of data synchronization , Visual drag and drop , It can help users quickly create data integration tasks , The user drags out two operators , Fill in the corresponding parameters to create a data integration task .

null
Flinkful It's for  Flink  One developed Java client .

null
Flink  As a popular computing engine , There are many ways for users to use , for instance
Command line interface 、HTTP  Interface
etc. , Through the command line interface, users can submit tasks 、 Create and cancel tasks ;HTTP  The interface is mainly used for  Web UI  Interface .

It's docking  Flink  The process of , We found that  Flink  As a run on  JVM  An application on the same runs on  JVM  Above  Scaleph  application , The integration of the two should pass  shell  Script , It's unreasonable . So we developed  Flinkful, open  Flink  stay  Java  Ecological openness , Let users through  Flinkful, It can be directly to  Flink  Cluster and task management .

We think  Flinkful  Yes  Flink  Infrastructure maintenance personnel are more meaningful , So from  Scaleph  Stripped out of the warehouse , Open source alone .

Plug-in system
. We want to define plug-ins , Provide system expansion interface , Users and  Scaleph  Developers can quickly enhance through these interfaces  Scaleph  Functions and features of . At present, we have defined two plug-ins , They are data source plug-ins and  SeaTunnel  plug-in unit , Through the data source plug-in, you can quickly expand  JDBC、ES、Kafka、Clinkhouse Data sources like that , Gather these data sources into  Scaleph  The system is uniformly configured and used .

null
at present  SeaTunnel  There are a lot of  connector  and  transform  plug-in unit , If you develop pages one by one , It's a time-consuming thing ,
We just want to use a simple 、 Declarative way , hold  SeaTunnel  Relevant parameters are defined
, Can quickly put  SeaTunnel  Relevant capabilities are completely transferred to  Scaleph  Project up .

Problem analysis


Flink-jdbc-connector  Functional enhancement


SeaTunnel  Many cases in official documents , It's all about  FakeSource and  ConsoleSink  Realized , And we are developing with  jdbc-connector  Dominant . In the integration process , We found that  flink-jdbc-connector  The plug-in  JdbcSink  Only support  Stream  mode , Then we realized it  Batch  Pattern .

JdbcSource  You need to provide  sql,
The program gets  sql  The column of 、 Table information , To generate  JdbcSource  Of  RowTypeInfo.
But the definition is complex  sql  Alias will appear when 、 Sub query and so on , Regular expressions are difficult to cover all scenarios . We use  Jdbc  Of  Connection  Get  sql  Of  ResultSet, from  ResultSet  Direct access to  sql  Column information for , To generate  JdbcSource  Of  RowTypeInfo.

 Seatunnel-core-flink.jar  Slimming


SeaTunnel  Is running on the  Flink  and  Spark  above , The two will be divided into two  jar  package ,seatunnel-core-flink.jar  Namely  Flink  Corresponding implementation . stay 2.1.1 In the version ,Seatunnel  Based on  flink  Realized  connector  All hit this  fat jar  In bag .

When you really use it , Data synchronization task , Only one of them may be used  1-2  Kind of  connector.Seatunnel  There will be a certain amount of additional network overhead when the task is submitted .

We want to achieve this effect :
There is a comparison  thin  Of  core jar package , Then add related  connector  Of  jar  package .
At the time of submission , With core-jar Package based , Plus related  connector  Of  jar  package . At the same time, the resources mentioned above  jar  Package upload , Such as  SeaTunnel  Of  jdbc-connector  lack of  JDBC  Drive pack , Carry resources  jar  Bao He  connector jar  The task submission of packages is handled in the same way .

Later, the community carried out  connector  When you split it up , We are also actively involved  issue  I shared relevant experience , When  Seatunnel 2.1.2  When it was released , Our system is also easily adapted  seatunnel-core-flink.jar  and  connector jar  Separate release form . At the same time, the user is not  Flink  The cluster is prepared in advance  JDBC  In the case of driving , You can also use the function of resource management , Upload driver package , Submitting  SeaTunnel  When the task , Submit with the driver package .

Flink jobId  Get questions

Flink  The core way of task submission is in the form of command line interface , Therefore, users need to pass  shell  Script to submit  Flink  Mission .Flink  After the task is submitted ,
The command-line client will send the corresponding task  id  Output to console log
, Users need to capture the log output to the console , Extract tasks from them  id.

Because our project and  Flink  All interactions are through  Flinkful  Library implementation ,Flinkful  You can put such a  jobId 
Send it back directly as the return value of the interface call
. So our implementation is compared with capture console log extraction  jobId  It's more elegant .

SeaTunnel  call  System.exit()  problem


SeaTunnel  When the task is carried out , First, the configuration file written by the user will be checked , If the check fails , Will call directly  System.exit(), And then this time  JVM  So I quit .SeaTunnel  Its submission method is  shell  Foot implementation , therefore  JVM  There is no problem exiting .
But when  Scaleph  System , When integrating it into our application , When calling this method , It will cause us  Scaleph  Such an application will directly hang up , One of our services is unavailable . therefore , We also submit this piece of code for the task , adopt  
SecurityManager
, Added a related permission limit , Then stipulate  SeaTunnel  Relevant task submission procedures , Ban called  System.exit()  Method .

05 SeaTunnel  Community contribution


Develop with me  Scaleph  A friend , Here are some of our submissions  pr, Like the above  jdbc-connector  Enhancements to . And that is  jdbc-connector  Of  upsert  Function realization .flink-jdbc-connector  Of  JdbcSink  A big drawback of is
Only support  insert  function , It can't be done  update, This will considerably limit this  connector  The function of .
We also developed  upsert  Semantic support , Support repeated synchronization of data .

01  System demonstration


This project can be carried out if there is enough time  Docker  The environment and  IDE  Environment demonstration , Here time is limited, so choose  Docker  Environment for everyone to demonstrate , Demo video ( Direct jump 23'18s):


02  Follow up development plan


At present, we will still put  SeaTunnel  dependent  connector  and  transform  plug-in unit , Move to our visual drag and drop page , It can make users feel completely  SeaTunnel  A powerful . The other is to follow  SeaTunnel-connector  Rich plug-ins , Also want to  connector  The corresponding data source types are enriched .
We also hope to do something for data development and data integration  DAG  Related scheduling , At the same time, we also hope to support in data development  SQL  Task development .

Apache SeaTunnel

null
//   Keep in contact  //
Little helper  : Seatunnel1  remarks infoq
Come on , Grow with the community !

Apache SeaTunnel(Incubating)
  It's a distribution 、 High performance 、 Easy to expand 、 For massive data ( offline & real time ) Data integration platform for synchronization and transformation .

Warehouse address : 
https://github.com/apache/incubator-seatunnel

website :
https://seatunnel.apache.org/

Proposal:
https://cwiki.apache.org/confluence/display/INCUBATOR/SeaTunnelProposal

Apache SeaTunnel(Incubating) 2.1.0  Download address :
https://seatunnel.apache.org/download
 
We sincerely welcome more people to join !

Can enter  Apache  The incubator ,SeaTunnel( primary  Waterdrop)  The new journey has just begun , But the development of the community needs more people to join . We believe that , stay
「Community Over Code」
( Community greater than code )、
「Open and Cooperation」
( Open collaboration )、
「Meritocracy」
( Elite management )、 as well as 「
Diversity and consensus decision making 」
etc.  The Apache Way  Under the guidance of , We will usher in a more diversified and inclusive community ecology , Technological progress brought by the spirit of co construction and open source !

We sincerely invite all partners who are interested in making local open source based on the world to join  SeaTunnel  Contributor family , Build open source together !

Submit questions and suggestions :
https://github.com/apache/incubator-seatunnel/issues

Contribution code :
https://github.com/apache/incubator-seatunnel/pulls

Subscribe to community development mailing lists  : 
[email protected]

Develop mailing lists :
[email protected]

Join in  Slack:
https://join.slack.com/t/apacheseatunnel/shared_invite/zt-123jmewxe-RjB_DW3M3gV~xL91pZ0oVQ

Focus on  Twitter: 
https://twitter.com/ASFSeaTunnel

原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/185/202207042132513540.html