当前位置:网站首页>China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call and data script of billing environment
China Unicom transformed the Apache dolphin scheduler resource center to realize the one-stop access of cross cluster call and data script of billing environment
2022-07-26 07:20:00 【Dolphin scheduler community】

end 2022 year , The user scale of China Unicom has reached 4.6 Billion , It accounts for... Of China's population 30%, With 5G The popularization of , Operator, IT The system is generally faced with a large number of users 、 Mass bill 、 Diversified business 、 The impact of a series of changes such as networking mode .
At present , China Unicom handles more than... Bills every day 400 Billion bars . Based on this volume , Improve service level , Provide customers with more targeted services , It has also become the ultimate goal pursued by Unicom brand . China Unicom is collecting massive data 、 machining 、 desensitization 、 Encryption and other technologies and applications have emerged , It has a certain first mover advantage in the industry , In the future, it is bound to become an important promoter of the development of digital economy enabled by big data .
stay Apache DolphinScheduler 4 month Meetup On , We invited Bai Xuesong from China Unicom Software Research Institute , He shared with us 《DolphinScheduler Application in China Unicom billing environment 》.
This speech mainly includes three parts :
DolphinScheduler In the overall use of China Unicom
China Unicom billing service topic sharing
Planning for the next step

Cedar China Unicom software research institute Big Data Engineer
Graduated from China Agricultural University , Engaged in the construction and application of big data platform AI Platform building , by Apache DolphinScheduler contribution Apache SeaTunnel(Incubating) plug-in unit , And for Apache SeaTunnel(Incubating) share alluxio plug-in unit
01 Overall usage
First of all, let's explain that China Unicom is in DolphinScheduler Overall usage of :
Now our business mainly runs in 3 The earth 4 colony
The total number of task flows is about 300 about
The average daily task runs almost 5000 about
We use DolphinScheduler Components include Spark、Flink、SeaTunnel( primary Waterdrop), And... In stored procedures Presto And some of the Shell Script , The business covered includes auditing , Income sharing , Billing service , There are other businesses that need automation .

02 Business topic sharing
01 Cross cluster dual active service call
As mentioned above , Our business is running in 3 The earth 4 On the cluster , In this way, mutual data exchange and business calls between clusters are inevitable . How to manage and schedule these cross cluster data transmission tasks is an important problem , Our data is in the production cluster , It is very sensitive to cluster network bandwidth , Data transmission must be managed in an organized way .
On the other hand , We have some businesses that need to be called across clusters , for example A After the cluster data is in place B The cluster needs to start statistical tasks, etc , We choose Apache DolphinScheduler As scheduling and control , To solve these two problems .
First of all, let's explain our cross cluster data transmission process in AB On two clusters , We all use HDFS The underlying data storage , Across clusters HDFS Data exchange , According to the size and purpose of the data , We divide the data used into small batch data and large batch data , To the structure table , Configuration table, etc .
For small batch data , We mount it directly to the same Alluxio Data sharing on , In this way, the version problem caused by untimely data synchronization will not occur .
Like schedules and other large files , We use Distcp and Spark Mix for treatment ;
For structure table data , Use SeaTunnel on Spark The way ;
adopt Yarn Set the speed limit by queuing ;
Unstructured data use Distcp transmission , Through its own parameters Bandwidth Speed limit ;
These transmission tasks are all run in DolphinScheduler On the platform , Our overall data flow is mainly A Cluster data in place detection ,A Cluster data integrity verification ,AB Data transmission between clusters ,B Data audit and in place notification of the cluster .
Emphasize a little : Among them, we focus on DolphinScheduler The self-contained complement weight run , Repair failed tasks or incomplete data .

After completing the data synchronization and access across clusters , We will also use DolphinScheduler Make task calls across regions and clusters .
We are A There are two clusters , They are tests A1 And production A2, stay B Land has production B1 colony , We will take out two computers with intranet on each cluster IP As an interface machine , By means of 6 Built on one interface machine DolphinScheduler Set up a virtual cluster , Thus, the contents of the three clusters can be operated on the unified page ;
Q: How to realize from test to production on-line ?
A: stay A1 Task development on test , And after passing the test , Direct will worker Node changed to A2 In production ;
Q: encounter A2 There is something wrong with production , What if the data is not in place ?
A: We can switch directly to B1 In production , Realize manual dual active disaster recovery switching ;

Finally, we have some big tasks , In order to meet the timeliness of the task , You need to use two clusters to calculate , We will split the data into two parts and put them separately A2 and B1 above , Then run the task at the same time , Finally, the running results are sent back to the same cluster for merging , These task processes are basically through DolphinScheduler To make the call .
Please pay attention , In the process , We use DolphinScheduler Solved several problems :
Project cross cluster task dependency verification ;
Control the task environment variables at the node level ;
02 AI The development synchronization task runs
1、 Unified data access
We now have a simple AI Development platform , It mainly provides users with some Tensorflow and Spark ML Computing environment . Under the business demand , We need to connect the local file model of user training with the cluster file system , And it can provide unified access mode and deployment method , To solve this problem , We used Alluxio-fuse and DolphinScheduler These two tools .
Alluxio-fuse Connect local and cluster storage
DolphinScheduler Share local and clustered storage
Because we built AI Platform cluster and data cluster are two data clusters , So in the data cluster, we store data , utilize Spark SQL perhaps Hive Carry out some data preprocessing , Then we mount the processed data to Alluxio On , Finally through Alluxio fuse Mapping cross level groups to local files , So we're based on Conda Development environment of , You can access these data directly , In this way, we can achieve a unified way of data access , Access cluster data by accessing local data .

2、 One stop access to data scripts
After separating resources , Through preprocessing big data content, through data clustering , Through our AI Cluster to deal with training model and prediction model , ad locum , We use Alluxio-fuse Yes DolphinScheduler The resource center of has been changed twice , We will DolphinScheduler The content center is connected to Alluxio On , Re pass Alluxio-fuse Mount both local files and cluster files , In this way DolphinSchedule The above can access the local training reasoning script at the same time , You can also access the information stored in hdfs Training reasoning data on , Realize one-stop access to data scripts .

03 Business query logic persistence
The third scenario is that we use Presto and Hue It provides users with a real-time query interface at the front desk , Because some users write through the front desk SQL, And after the test , Some processing logic and stored procedures need to be run regularly , So we need to get through from the front desk SQL The process of running tasks regularly in the background .

Another problem is Presto There is no resource isolation problem between native tenants . We also compared several schemes , Finally, combined with the actual situation, I chose Presto on Spark programme .
Because we are a multi tenant platform , At the beginning, the solution provided to users is the front-end Hue Interface , The back end directly uses native Presto Run on the physical cluster , This leads to the problem of contention for user resources . When there are some large queries or large processing logic , It will cause other tenants' businesses to wait for a long time .
So , We compared Presto on Yarn and Presto on Spark, After comprehensively comparing the performance, it is found that Presto on Spark The use of resources will be more efficient , Here, you can also choose the corresponding scheme according to your own needs .

On the other hand , We used native Presto and Presto on spark The way of coexistence , For some small amounts of data , The processing logic is relatively simple SQL, We put it directly in the original Presto Up operation , For some processing logic is more complex , Running for a long time SQL, It's in Presto on spark Up operation , In this way, users use a set of SQL You can switch to different underlying engines .
Besides , We also got through Hue To DolphinScheduler Scheduled task scheduling process . We are Hue on SQL After developing modulation , By storing to local Serve file , Connect to Git Version control .
We mount local files to Alluxio fuse On , As SQL Synchronous mount of , And finally we use Hue, adopt DolphinScheduler Of API Create tasks and scheduled tasks , from SQL Develop process control from time to time .

04 Unified management of data Lake data
The last scenario is the unified management of data Lake data , On our self-developed data integration platform , Use hierarchical governance to uniformly manage and access the data of the data lake , It uses DolphinScheduler As a scheduling and monitoring engine for entering the lake .
On the data integration platform , For data integration 、 Data into the lake 、 Data distribution these batch and real-time tasks , We use DolphinScheduler To schedule .
The bottom layer runs in Spark and Flink On , For the business needs of data query and data exploration that need immediate feedback , We use embedded Hue Access Spark and Presto Methods , Explore and query the data ; For data asset registration synchronization and data audit , Directly query the information of data source files , Directly synchronize the underlying data information .
The last scenario is the unified management of data Lake data , On our self-developed data integration platform , Use hierarchical governance to uniformly manage and access the data of the data lake , It uses DolphinScheduler As a scheduling and monitoring engine for entering the lake .
On the data integration platform , For data integration 、 Data into the lake 、 Data distribution these batch and real-time tasks , We use DolphinScheduler To schedule .
The bottom layer runs in Spark and Flink On , For the business needs of data query and data exploration that need immediate feedback , We use embedded Hue Access Spark and Presto Methods , Explore and query the data ; For data asset registration synchronization and data audit , Directly query the information of data source files , Directly synchronize the underlying data information .

At present, our integration platform basically manages 460 Quality management of data sheet , Provide unified management of data accuracy and punctuality .
03 Next step plans and requirements

01 Resource Center
At the resource center level , In order to facilitate file sharing between users , We plan to provide resource authorization for all users , At the same time, according to its ownership, the tenant , Assign tenant level shared files , Make it more friendly to a multi tenant platform .
02 User management
Secondly, it is related to the user's permission , We only provide tenant level administrator account , Subsequent user accounts are created by the tenant administrator account , At the same time, the user management in the tenant group is also controlled by the tenant administrator , To facilitate the internal management of tenants .
03 Task node
Finally, the plan related to our task node , It is now in progress : One is to complete SQL Optimization of nodes , Enables users to select a resource center SQL file , Without having to manually copy SQL; On the other hand HTTP Node pair returned json Custom parsing extraction field judgment , More friendly handling of complex return values .
04 Participation and contribution
With the rapid rise of domestic open source ,Apache DolphinScheduler The community is booming , In order to make better use of 、 Easy to use scheduling , Sincerely welcome partners who love open source to join the open source community , Contribute to the rise of China's open source , Let local open source go global .
Participate in DolphinScheduler The community has a lot of ways to participate and contribute , Include :

Contribute the first PR( file 、 Code ) We also hope it's simple , first PR It's used to familiarize with the submission process and community collaboration, and feel the friendliness of the community .
The community summarizes the following list of questions for novices :https://github.com/apache/dolphinscheduler/issues/5689
List of non novice questions :https://github.com/apache/dolphinscheduler/issues?q=is%3Aopen+is%3Aissue+label%3A"volunteer+wanted"
How to participate in contribution link :https://dolphinscheduler.apache.org/zh-cn/docs/development/contribute.html
Come on ,DolphinScheduler The open source community needs your participation , Contribute to the rise of China's open source , Even if it's just a small tile , The power that comes together is enormous .
If you participate in open source, you can compete with experts from all walks of life , Quickly improve your skills , If you want to contribute , We have a donor seed incubation group , You can add a community assistant wechat (Leonard-ds) , Hand in hand teaches you ( Contributors of all levels , Have a craigslist , The key is to have a willing heart to contribute ).
When adding a small assistant wechat, please explain that you want to participate in the contribution .
Come on , The open source community is looking forward to your participation .
05 Activity recommendation
When data resources become an essential element in the development of production and even in the process of survival , How can enterprises help enterprises implement data services in the whole life cycle through data integration ?5 month 14 Japan , Data integration framework Apache SeaTunnel(Incubating) One stop data integration platform will be invited Apache InLong(Incubating) Technical experts and open source contributors , Come to the live studio , Talk to everyone about using Apache SeaTunnel(Incubating) And Apache InLong(Incubating) Practical experience and experience after .
Affected by the epidemic, this activity is still carried out in the form of online live broadcast , The event is now open for free registration , Welcome to scan the QR code below , Or click on “ Read the original ” Free registration !
Live link :https://www.slidestalk.com/m/777
边栏推荐
- "Wei Lai Cup" 2022 Niuke summer multi school training camp 1 supplementary question record (acdgij)
- Qt:列表框、表格、树形控件
- Opencv learn read images videos and webcams
- Apache dolphin scheduler & tidb joint meetup | focus on application development capabilities under the development of open source ecosystem
- Curl post request on the server, using postman tool for parameter conversion
- C51与MDK共存 Keil5安装教程
- Opencv learning drawing shapes and text
- NFT digital collection system development: how enterprises develop their own digital collection platform
- NFT digital collection development: Six differences between digital collections and NFT
- Why can't extern compile variables decorated with const?
猜你喜欢
![Rgb-t tracking - [dataset benchmark] gtot / rgbt210 / rgbt234 / vot-2019-2020 / laser / VTUAV](/img/10/40d02da10a6f6779635dc820c074c6.png)
Rgb-t tracking - [dataset benchmark] gtot / rgbt210 / rgbt234 / vot-2019-2020 / laser / VTUAV

Opencv learning basic functions

Become an Apache contributor, so easy!

【C语言】你真的了解printf吗?(printf典型易错,强烈建议收藏)
![[C language] do you really know printf? (printf is typically error prone, and collection is strongly recommended)](/img/59/cf43b7dd16c203b4f31c1591615955.jpg)
[C language] do you really know printf? (printf is typically error prone, and collection is strongly recommended)

Countdown 2 days! Based on the cross development practice of Apache dolphin scheduler & tidb, you can greatly improve your efficiency from writing to scheduling

anaconda安装教程-手把手教你安装

4、数据的完整性

C51 and MDK coexist keil5 installation tutorial

Common CMD instructions
随机推荐
What are the basics of getting started with spot silver
HCIP---BGP综合实验
Unity3d asynchronous loading of scenes and progress bar loading
anaconda安装教程-手把手教你安装
HCIP---MPLS详解和BGP路由过滤
NFT digital collection system development: digital collections give new vitality to brands
centos7下的MySQL57版本,遇到一个问题不理解有知道的大shen告知
Introduction to C language -- summary table of operator priority sorting
This section is for Supplement 2
正则表达式如何写变量
4、数据的完整性
Image annotation software reference
DaemonSet
Apache dolphin scheduler & tidb joint meetup | focus on application development capabilities under the development of open source ecosystem
NFT数字藏品系统开发:企业如何开发属于自己的数藏平台
The interface automation test with a monthly salary of 12k+ takes you to get started in 3 minutes
Event loop in browser
单身杯web wp
Learn browser decoding from XSS payload
Summer Challenge harmonyos - hamster game based on arkui (JS)