当前位置：网站首页>Visual task scheduling & drag and drop | scalph data integration based on Apache seatunnel

Visual task scheduling & drag and drop | scalph data integration based on Apache seatunnel

2022-07-04 22:05:00 【InfoQ】

This time 6 month Meetup What I bring to you is Scaleph be based on Apache SeaTunnel (Incubating) Introduction to data integration , Hope you found something .

This speech mainly includes five parts ：

About Scaleph
Scaleph framework & Function introduction
SeaTunnel Community contribution
System demonstration
Development plan

null

Wang Qi

Apache SeaTunnel Contributor

Search for recommended engineers , big data Java Development

01 Scaleph The origin of

I was first engaged in search recommendation , Be responsible for maintenance in the team Dump System , Mainly for our search engine

Provide the function of feeding data

, First of all, let me introduce the main problems in the maintenance process 5 Pain points ：

Timeliness and stability

Search recommendation is the core online system of e-commerce platform , In particular, the timeliness and stability of data are highly required . Because search recommendations will receive

The whole e-commerce platform C End

Most of the traffic , So once the service fluctuates , It may cause service damage , The user experience is greatly reduced .

Business complex / Wide table design

Dump The system will integrate the e-commerce platform

goods 、 category 、 brand 、 The store 、 Product label 、 Real time data warehouse / Offline data and model data

It will go through a series of pretreatment , The final output is a wide table , In the process , The complexity and variability of business , Will invade Dump From the system , Therefore, the technical challenges to deal with are relatively higher .

Total quantity + Real time index

Run the full index once a day , The main purpose is to update

T+1 frequency

Updated data . When the full index is finished , We will refresh the data that needs to be updated in real time through real-time index , For example, the price of goods 、 Information about inventory changes .

Data linkage update

We have many upstream data sources , Yes

Message queue 、 database 、 Big data related storage and dubbo Interface

, Because it is a wide table design , Take the commodity index as an example , The wide watch will be mainly commodity , If it is a store index , Mainly stores , Depending on the data , Upstream data changes are not necessarily product or store dimensions , Data will also produce certain linkage updates .

The bottom line of the data

Search recommendation service was also responsible for C Most of the traffic at the end , When the performance of other teams in the company cannot keep up , They usually pass the data Dump The system sends it to the search engine , Then our team returned to Web page , Avoid subsequent secondary request calls to them .

meanwhile , If other teams' business systems generate dirty data , Also needed Dump The system does data protection , Prevent data leakage to C End users have a bad impact , So when developing and maintaining , It's also very difficult .

02 Why introduce Flink？

As a domestic Flink Early users of , Alibaba has a long history and successful experience in the field of search recommendation , Develop and maintain in the search recommendation team Dump The professional experience of the system prompted me to pay attention to using Flink do

A/B Experimental report 、 Real time data flow

Other related work , It mainly uses Flink To achieve Dump The system provides for search Dump Platform work , Use Flink Doing data integration has 5 advantage ：

Natural distributed support ：Flink Support multiple deployment and operation modes , stand-alone 、yarn、Kubernetes;
Low latency 、 Massive throughput ： It is widely used in many large factories ;
Ecological support ：Flink There are many out of the box connector, Support csv、avro data format ,kafka、pulsar And many storage systems , Closely integrate with the big data ecosystem ;
Based on distributed lightweight asynchronous snapshot mechanism exactly-once semantics , For the failure of the task 、 restart 、 transfer 、 Upgrade, etc. to provide data consistency guarantee ;
metrics.Flink In addition to the metrics Outside ,metrics The framework allows users to develop customized metrics, Enrich monitoring indicators ;

03 Why choose SeaTunnel？

Later contact SeaTunnel When , Love it SeaTunnel Design concept of ！

SeaTunnel Is running on the Flink and Spark above , Next generation integration framework for high-performance and distributed massive data .

The important thing is that it is out of the box , And it can realize seamless integration for the existing ecosystem , Because it runs on Flink and Spark above , It can easily access the existing Flink and Spark Infrastructure of .

On the other hand SeaTunnel There are also many production cases

, When entering Apache After the foundation incubates , The community is very active , Future period .

04 About Scaleph

The starting point of the project

Our initial idea was to SeaTunnel Provide Web page , Be able to build an open source system for data integration . At present, our main goal is SeaTunnel

Make an open source visual data development and management system

, Expect later Scaleph It can minimize the development threshold of real-time and offline data tasks , Provide developers with a one-stop data development platform .

Project highlights

In real production applications , During data integration , Arrange or SQL Development is the main form of data integration , We think

Drag and Drop Visual task choreography

It can minimize the burden of data integration for users ;

The other is to realize multi version management of jobs , Data source support ;

Flink The cluster supports multiple versions / Multiple deployment environments ;

real time / Periodic tasks are also supported .

The above is the architecture diagram of our system , Users mainly use Web UI, Encapsulated by the job management function SeaTunnel operator , Users drag and drop configuration on the page , Automatic system generation SeaTunnel Configuration file for , Finally, through the resources uploaded by users in resource management

jar

The bags pass together Flinkful Library submitted to Flink In the cluster . Resources for resource management

jar

The purpose of the package is to support users to upload their own R & D related

jar

package , Make up SeaTunnel Related defects , Or right SeaTunnel and Flink Enhance its functions ！

We use it quartz Developed a scheduling task , When the task is submitted to Flink after , The task will go regularly Flink The cluster pulls the task information , Store in MySQL Inside , End users are Web UI You can see the task related operation information on the page .

Scaleph Function introduction （ Data development ）

01 project management

Mainly when users create data synchronization tasks , Be able to carry out relevant management work according to different business dimensions .

02 Job management

By dragging and dropping, you can create SeaTunnel Data tasks , Then submit and run accordingly .

03 Resource management

SeaTunnel In order to Apache2.0 Open source certificates are open source , And MySQL Of JDBC The open source protocol of the driver package is incompatible ,SeaTunnel Of jdbc connector Do not provide relevant JDBC driving-dependent . When users use jdbc connector when ,

It needs to be provided by itself JDBC Drive pack .

Here we provide the function of resource management , Users can upload the driver package by themselves , Then take it. SeaTunnel The tasks and MySQL Drivers are submitted to the cluster together to ensure the normal operation of tasks .

04 Cluster management

Mainly to provide Flink Input of cluster information , At present, it can support Standalone Session Cluster entry , After the user enters , Submit SeaTunnel Cluster can be selected when working , The task will run in the cluster .

05 Data source management

Support users to enter some data source information in advance , In this way, it is not necessary to input the data source information for each task . meanwhile , You can also share data sources and restrict permissions , Prevent the disclosure of data source information in plaintext .

Scaleph Function introduction （ Operation and maintenance center ）

The operation and maintenance center is a running log of real-time tasks and periodic tasks , Users see task related information when submitting tasks , We also provide

Link jump operation

, Users can click to jump to Flink Of Web UI The above to , adopt Flink Official Web UI page , You can see the specific execution information of the task .

Scaleph Function introduction （ Data standards ）

01 Data element

Data governance is a big system , Everyone is more concerned

Metadata 、 Data consanguinity 、 Data assets

, But data standards are also an important part of data governance , We open source the standard system used internally by the company , Share the relevant knowledge of data standards .

In the development process of many data warehouses , Because it is multi person collaboration , Fields with the same meaning , In different model tables , Development will define different fields to express the same meaning and business . The data standard hopes to pass the data element , To unify the model field definitions of data warehouse developers .

02 Reference data

The data in the data warehouse is pulled from the business system through data integration tools , Inevitably, fields with the same meaning will have different definitions in different business systems , These fields with the same meaning and different definitions need to be maintained by warehouse personnel , And the maintenance process is mainly offline documents , Possible

Maintenance is out of date

The situation of .

At the same time, there will also be the problem that business knowledge cannot be directly mapped to data warehouse model information , Data standards allow users to Web Maintain these business knowledge in the page .

The above figure is a specific case . Here are two business systems defined , One is the system A, One is the system B, They have different gender enumeration values , meanwhile A/B The enumeration description of the system is also different , What to do with that ？

This is the time , We can set a set of unified standards through data warehouse developers , For example, the coding system must be

0,1,2,

The corresponding description is also well defined , Map through a reference data in the middle , Users can easily see .

03 Follow up ideas

Whether it can be in the process of data integration , Directly through data standards Transform operation , Realize automatic maintenance and mapping of knowledge and models .

04 Scaleph Feature highlights

Visual development of data .

We believe that in the field of data synchronization , Visual drag and drop , It can help users quickly create data integration tasks , The user drags out two operators , Fill in the corresponding parameters to create a data integration task .

Flinkful It's for Flink One developed Java client .

Flink As a popular computing engine , There are many ways for users to use , for instance

Command line interface 、HTTP Interface

etc. , Through the command line interface, users can submit tasks 、 Create and cancel tasks ;HTTP The interface is mainly used for Web UI Interface .

It's docking Flink The process of , We found that Flink As a run on JVM An application on the same runs on JVM Above Scaleph application , The integration of the two should pass shell Script , It's unreasonable . So we developed Flinkful, open Flink stay Java Ecological openness , Let users through Flinkful, It can be directly to Flink Cluster and task management .

We think Flinkful Yes Flink Infrastructure maintenance personnel are more meaningful , So from Scaleph Stripped out of the warehouse , Open source alone .

Plug-in system

. We want to define plug-ins , Provide system expansion interface , Users and Scaleph Developers can quickly enhance through these interfaces Scaleph Functions and features of . At present, we have defined two plug-ins , They are data source plug-ins and SeaTunnel plug-in unit , Through the data source plug-in, you can quickly expand JDBC、ES、Kafka、Clinkhouse Data sources like that , Gather these data sources into Scaleph The system is uniformly configured and used .

at present SeaTunnel There are a lot of connector and transform plug-in unit , If you develop pages one by one , It's a time-consuming thing ,

We just want to use a simple 、 Declarative way , hold SeaTunnel Relevant parameters are defined

, Can quickly put SeaTunnel Relevant capabilities are completely transferred to Scaleph Project up .

Problem analysis

Flink-jdbc-connector Functional enhancement

SeaTunnel Many cases in official documents , It's all about FakeSource and ConsoleSink Realized , And we are developing with jdbc-connector Dominant . In the integration process , We found that flink-jdbc-connector The plug-in JdbcSink Only support Stream mode , Then we realized it Batch Pattern .

JdbcSource You need to provide sql,

The program gets sql The column of 、 Table information , To generate JdbcSource Of RowTypeInfo.

But the definition is complex sql Alias will appear when 、 Sub query and so on , Regular expressions are difficult to cover all scenarios . We use Jdbc Of Connection Get sql Of ResultSet, from ResultSet Direct access to sql Column information for , To generate JdbcSource Of RowTypeInfo.

 Seatunnel-core-flink.jar Slimming

SeaTunnel Is running on the Flink and Spark above , The two will be divided into two jar package ,seatunnel-core-flink.jar Namely Flink Corresponding implementation . stay 2.1.1 In the version ,Seatunnel Based on flink Realized connector All hit this fat jar In bag .

When you really use it , Data synchronization task , Only one of them may be used 1-2 Kind of connector.Seatunnel There will be a certain amount of additional network overhead when the task is submitted .

We want to achieve this effect ：

There is a comparison thin Of core jar package , Then add related connector Of jar package .

At the time of submission , With core-jar Package based , Plus related connector Of jar package . At the same time, the resources mentioned above jar Package upload , Such as SeaTunnel Of jdbc-connector lack of JDBC Drive pack , Carry resources jar Bao He connector jar The task submission of packages is handled in the same way .

Later, the community carried out connector When you split it up , We are also actively involved issue I shared relevant experience , When Seatunnel 2.1.2 When it was released , Our system is also easily adapted seatunnel-core-flink.jar and connector jar Separate release form . At the same time, the user is not Flink The cluster is prepared in advance JDBC In the case of driving , You can also use the function of resource management , Upload driver package , Submitting SeaTunnel When the task , Submit with the driver package .

Flink jobId Get questions

Flink The core way of task submission is in the form of command line interface , Therefore, users need to pass shell Script to submit Flink Mission .Flink After the task is submitted ,

The command-line client will send the corresponding task id Output to console log

, Users need to capture the log output to the console , Extract tasks from them id.

Because our project and Flink All interactions are through Flinkful Library implementation ,Flinkful You can put such a jobId

Send it back directly as the return value of the interface call

. So our implementation is compared with capture console log extraction jobId It's more elegant .

SeaTunnel call System.exit() problem

SeaTunnel When the task is carried out , First, the configuration file written by the user will be checked , If the check fails , Will call directly System.exit(), And then this time JVM So I quit .SeaTunnel Its submission method is shell Foot implementation , therefore JVM There is no problem exiting .

But when Scaleph System , When integrating it into our application , When calling this method , It will cause us Scaleph Such an application will directly hang up , One of our services is unavailable . therefore , We also submit this piece of code for the task , adopt

SecurityManager

, Added a related permission limit , Then stipulate SeaTunnel Relevant task submission procedures , Ban called System.exit() Method .

05 SeaTunnel Community contribution

Develop with me Scaleph A friend , Here are some of our submissions pr, Like the above jdbc-connector Enhancements to . And that is jdbc-connector Of upsert Function realization .flink-jdbc-connector Of JdbcSink A big drawback of is

Only support insert function , It can't be done update, This will considerably limit this connector The function of .

We also developed upsert Semantic support , Support repeated synchronization of data .

01 System demonstration

This project can be carried out if there is enough time Docker The environment and IDE Environment demonstration , Here time is limited, so choose Docker Environment for everyone to demonstrate , Demo video （ Direct jump 23'18s）：

02 Follow up development plan

At present, we will still put SeaTunnel dependent connector and transform plug-in unit , Move to our visual drag and drop page , It can make users feel completely SeaTunnel A powerful . The other is to follow SeaTunnel-connector Rich plug-ins , Also want to connector The corresponding data source types are enriched .

We also hope to do something for data development and data integration DAG Related scheduling , At the same time, we also hope to support in data development SQL Task development .

Apache SeaTunnel

//  Keep in contact  //

Little helper : Seatunnel1 remarks infoq

Come on , Grow with the community ！

Apache SeaTunnel(Incubating)

It's a distribution 、 High performance 、 Easy to expand 、 For massive data （ offline & real time ） Data integration platform for synchronization and transformation .

Warehouse address ： 

https://github.com/apache/incubator-seatunnel

website ：

https://seatunnel.apache.org/

Proposal：

https://cwiki.apache.org/confluence/display/INCUBATOR/SeaTunnelProposal

Apache SeaTunnel(Incubating) 2.1.0 Download address ：

https://seatunnel.apache.org/download

We sincerely welcome more people to join ！

Can enter Apache The incubator ,SeaTunnel( primary Waterdrop) The new journey has just begun , But the development of the community needs more people to join . We believe that , stay

「Community Over Code」

（ Community greater than code ）、

「Open and Cooperation」

（ Open collaboration ）、

「Meritocracy」

（ Elite management ）、 as well as 「

Diversity and consensus decision making 」

etc. The Apache Way Under the guidance of , We will usher in a more diversified and inclusive community ecology , Technological progress brought by the spirit of co construction and open source ！

We sincerely invite all partners who are interested in making local open source based on the world to join SeaTunnel Contributor family , Build open source together !

Submit questions and suggestions ：

https://github.com/apache/incubator-seatunnel/issues

Contribution code ：

https://github.com/apache/incubator-seatunnel/pulls

Subscribe to community development mailing lists : 

[email protected]

Develop mailing lists ：

[email protected]

Join in Slack:

https://join.slack.com/t/apacheseatunnel/shared_invite/zt-123jmewxe-RjB_DW3M3gV~xL91pZ0oVQ

Focus on Twitter: 

https://twitter.com/ASFSeaTunnel

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/185/202207042132513540.html

当前位置：网站首页>Visual task scheduling & drag and drop | scalph data integration based on Apache seatunnel

Visual task scheduling & drag and drop | scalph data integration based on Apache seatunnel