当前位置：网站首页>China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call and data script of billing environment

China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call and data script of billing environment

2022-07-26 07:20:00 【Dolphin scheduler community】

end 2022 year , The user scale of China Unicom has reached 4.6 Billion , It accounts for... Of China's population 30%, With 5G The popularization of , Operator, IT The system is generally faced with a large number of users 、 Mass bill 、 Diversified business 、 The impact of a series of changes such as networking mode .

At present , China Unicom handles more than... Bills every day 400 Billion bars . Based on this volume , Improve service level , Provide customers with more targeted services , It has also become the ultimate goal pursued by Unicom brand . China Unicom is collecting massive data 、 machining 、 desensitization 、 Encryption and other technologies and applications have emerged , It has a certain first mover advantage in the industry , In the future, it is bound to become an important promoter of the development of digital economy enabled by big data .

stay Apache DolphinScheduler 4 month Meetup On , We invited Bai Xuesong from China Unicom Software Research Institute , He shared with us 《DolphinScheduler Application in China Unicom billing environment 》.

This speech mainly includes three parts ：

DolphinScheduler In the overall use of China Unicom
China Unicom billing service topic sharing
Planning for the next step

Cedar China Unicom software research institute Big Data Engineer

Graduated from China Agricultural University , Engaged in the construction and application of big data platform AI Platform building , by Apache DolphinScheduler contribution Apache SeaTunnel(Incubating) plug-in unit , And for Apache SeaTunnel(Incubating) share alluxio plug-in unit

01 Overall usage

First of all, let's explain that China Unicom is in DolphinScheduler Overall usage of ：

Now our business mainly runs in 3 The earth 4 colony
The total number of task flows is about 300 about
The average daily task runs almost 5000 about

We use DolphinScheduler Components include Spark、Flink、SeaTunnel（ primary Waterdrop）, And... In stored procedures Presto And some of the Shell Script , The business covered includes auditing , Income sharing , Billing service , There are other businesses that need automation .

02 Business topic sharing

01 Cross cluster dual active service call

As mentioned above , Our business is running in 3 The earth 4 On the cluster , In this way, mutual data exchange and business calls between clusters are inevitable . How to manage and schedule these cross cluster data transmission tasks is an important problem , Our data is in the production cluster , It is very sensitive to cluster network bandwidth , Data transmission must be managed in an organized way .

On the other hand , We have some businesses that need to be called across clusters , for example A After the cluster data is in place B The cluster needs to start statistical tasks, etc , We choose Apache DolphinScheduler As scheduling and control , To solve these two problems .

First of all, let's explain our cross cluster data transmission process in AB On two clusters , We all use HDFS The underlying data storage , Across clusters HDFS Data exchange , According to the size and purpose of the data , We divide the data used into small batch data and large batch data , To the structure table , Configuration table, etc .

For small batch data , We mount it directly to the same Alluxio Data sharing on , In this way, the version problem caused by untimely data synchronization will not occur .

Like schedules and other large files , We use Distcp and Spark Mix for treatment ;
For structure table data , Use SeaTunnel on Spark The way ;
adopt Yarn Set the speed limit by queuing ;
Unstructured data use Distcp transmission , Through its own parameters Bandwidth Speed limit ;

These transmission tasks are all run in DolphinScheduler On the platform , Our overall data flow is mainly A Cluster data in place detection ,A Cluster data integrity verification ,AB Data transmission between clusters ,B Data audit and in place notification of the cluster .

Emphasize a little ： Among them, we focus on DolphinScheduler The self-contained complement weight run , Repair failed tasks or incomplete data .

After completing the data synchronization and access across clusters , We will also use DolphinScheduler Make task calls across regions and clusters .

We are A There are two clusters , They are tests A1 And production A2, stay B Land has production B1 colony , We will take out two computers with intranet on each cluster IP As an interface machine , By means of 6 Built on one interface machine DolphinScheduler Set up a virtual cluster , Thus, the contents of the three clusters can be operated on the unified page ;

Q： How to realize from test to production on-line ？

A： stay A1 Task development on test , And after passing the test , Direct will worker Node changed to A2 In production ;

Q： encounter A2 There is something wrong with production , What if the data is not in place ？

A： We can switch directly to B1 In production , Realize manual dual active disaster recovery switching ;

Finally, we have some big tasks , In order to meet the timeliness of the task , You need to use two clusters to calculate , We will split the data into two parts and put them separately A2 and B1 above , Then run the task at the same time , Finally, the running results are sent back to the same cluster for merging , These task processes are basically through DolphinScheduler To make the call .

Please pay attention , In the process , We use DolphinScheduler Solved several problems ：

Project cross cluster task dependency verification ;
Control the task environment variables at the node level ;

02 AI The development synchronization task runs

1、 Unified data access

We now have a simple AI Development platform , It mainly provides users with some Tensorflow and Spark ML Computing environment . Under the business demand , We need to connect the local file model of user training with the cluster file system , And it can provide unified access mode and deployment method , To solve this problem , We used Alluxio-fuse and DolphinScheduler These two tools .

Alluxio-fuse Connect local and cluster storage
DolphinScheduler Share local and clustered storage

Because we built AI Platform cluster and data cluster are two data clusters , So in the data cluster, we store data , utilize Spark SQL perhaps Hive Carry out some data preprocessing , Then we mount the processed data to Alluxio On , Finally through Alluxio fuse Mapping cross level groups to local files , So we're based on Conda Development environment of , You can access these data directly , In this way, we can achieve a unified way of data access , Access cluster data by accessing local data .

2、 One stop access to data scripts

After separating resources , Through preprocessing big data content, through data clustering , Through our AI Cluster to deal with training model and prediction model , ad locum , We use Alluxio-fuse Yes DolphinScheduler The resource center of has been changed twice , We will DolphinScheduler The content center is connected to Alluxio On , Re pass Alluxio-fuse Mount both local files and cluster files , In this way DolphinSchedule The above can access the local training reasoning script at the same time , You can also access the information stored in hdfs Training reasoning data on , Realize one-stop access to data scripts .

03 Business query logic persistence

The third scenario is that we use Presto and Hue It provides users with a real-time query interface at the front desk , Because some users write through the front desk SQL, And after the test , Some processing logic and stored procedures need to be run regularly , So we need to get through from the front desk SQL The process of running tasks regularly in the background .

Another problem is Presto There is no resource isolation problem between native tenants . We also compared several schemes , Finally, combined with the actual situation, I chose Presto on Spark programme .

Because we are a multi tenant platform , At the beginning, the solution provided to users is the front-end Hue Interface , The back end directly uses native Presto Run on the physical cluster , This leads to the problem of contention for user resources . When there are some large queries or large processing logic , It will cause other tenants' businesses to wait for a long time .

So , We compared Presto on Yarn and Presto on Spark, After comprehensively comparing the performance, it is found that Presto on Spark The use of resources will be more efficient , Here, you can also choose the corresponding scheme according to your own needs .

On the other hand , We used native Presto and Presto on spark The way of coexistence , For some small amounts of data , The processing logic is relatively simple SQL, We put it directly in the original Presto Up operation , For some processing logic is more complex , Running for a long time SQL, It's in Presto on spark Up operation , In this way, users use a set of SQL You can switch to different underlying engines .

Besides , We also got through Hue To DolphinScheduler Scheduled task scheduling process . We are Hue on SQL After developing modulation , By storing to local Serve file , Connect to Git Version control .

We mount local files to Alluxio fuse On , As SQL Synchronous mount of , And finally we use Hue, adopt DolphinScheduler Of API Create tasks and scheduled tasks , from SQL Develop process control from time to time .

04 Unified management of data Lake data

The last scenario is the unified management of data Lake data , On our self-developed data integration platform , Use hierarchical governance to uniformly manage and access the data of the data lake , It uses DolphinScheduler As a scheduling and monitoring engine for entering the lake .

On the data integration platform , For data integration 、 Data into the lake 、 Data distribution these batch and real-time tasks , We use DolphinScheduler To schedule .

The bottom layer runs in Spark and Flink On , For the business needs of data query and data exploration that need immediate feedback , We use embedded Hue Access Spark and Presto Methods , Explore and query the data ; For data asset registration synchronization and data audit , Directly query the information of data source files , Directly synchronize the underlying data information .

On the data integration platform , For data integration 、 Data into the lake 、 Data distribution these batch and real-time tasks , We use DolphinScheduler To schedule .

At present, our integration platform basically manages 460 Quality management of data sheet , Provide unified management of data accuracy and punctuality .

03 Next step plans and requirements

01 Resource Center

At the resource center level , In order to facilitate file sharing between users , We plan to provide resource authorization for all users , At the same time, according to its ownership, the tenant , Assign tenant level shared files , Make it more friendly to a multi tenant platform .

02 User management

Secondly, it is related to the user's permission , We only provide tenant level administrator account , Subsequent user accounts are created by the tenant administrator account , At the same time, the user management in the tenant group is also controlled by the tenant administrator , To facilitate the internal management of tenants .

03 Task node

Finally, the plan related to our task node , It is now in progress ： One is to complete SQL Optimization of nodes , Enables users to select a resource center SQL file , Without having to manually copy SQL; On the other hand HTTP Node pair returned json Custom parsing extraction field judgment , More friendly handling of complex return values .

04 Participation and contribution

With the rapid rise of domestic open source ,Apache DolphinScheduler The community is booming , In order to make better use of 、 Easy to use scheduling , Sincerely welcome partners who love open source to join the open source community , Contribute to the rise of China's open source , Let local open source go global .

Participate in DolphinScheduler The community has a lot of ways to participate and contribute , Include ：

Contribute the first PR( file 、 Code ) We also hope it's simple , first PR It's used to familiarize with the submission process and community collaboration, and feel the friendliness of the community .

The community summarizes the following list of questions for novices ：https://github.com/apache/dolphinscheduler/issues/5689

List of non novice questions ：https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A"volunteer+wanted"

How to participate in contribution link ：https://dolphinscheduler.apache.org/zh-cn/docs/development/contribute.html

Come on ,DolphinScheduler The open source community needs your participation , Contribute to the rise of China's open source , Even if it's just a small tile , The power that comes together is enormous .

If you participate in open source, you can compete with experts from all walks of life , Quickly improve your skills , If you want to contribute , We have a donor seed incubation group , You can add a community assistant wechat (Leonard-ds) , Hand in hand teaches you ( Contributors of all levels , Have a craigslist , The key is to have a willing heart to contribute ).

When adding a small assistant wechat, please explain that you want to participate in the contribution .

Come on , The open source community is looking forward to your participation .

05 Activity recommendation

When data resources become an essential element in the development of production and even in the process of survival , How can enterprises help enterprises implement data services in the whole life cycle through data integration ？5 month 14 Japan , Data integration framework Apache SeaTunnel(Incubating) One stop data integration platform will be invited Apache InLong(Incubating) Technical experts and open source contributors , Come to the live studio , Talk to everyone about using Apache SeaTunnel(Incubating) And Apache InLong(Incubating) Practical experience and experience after .

Affected by the epidemic, this activity is still carried out in the form of online live broadcast , The event is now open for free registration , Welcome to scan the QR code below , Or click on “ Read the original ” Free registration ！

Live link ：https://www.slidestalk.com/m/777

原网站

版权声明
本文为[Dolphin scheduler community]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207260711470903.html