当前位置:网站首页>Evolution and practice of Unicom real-time computing platform

Evolution and practice of Unicom real-time computing platform

2022-06-12 03:21:00 ApacheFlink

Abstract : The content of this article is compiled from mu Chunjin, a big data technology expert of China Unicom, in Flink Forward Asia 2021 Special speech on platform construction . The main contents include :

  1. Background of real-time computing platform
  2. Evolution and practice of real-time computing platform
  3. be based on Flink Cluster governance
  4. The future planning

FFA 2021 Live playback & speech PDF download

One 、 Background of real-time computing platform

img

First, let's take a look at the background of real-time computing platform . The business system of telecom industry is very complex , So there are many data sources , At present, the real-time computing platform is connected to 30 Multiple data sources , this 30 Multiple data sources are relatively small compared with the total data types . Even so , Our data volume has also reached the trillion level , Every day 600TB Data increment of , And the type and size of data sources we access continue to grow . The users of the platform come from all over the country 31 Provincial companies and subsidiaries of China Unicom Group , Especially on holidays, a large number of users will subscribe to the rules . Users want to get data , You need to subscribe on the platform , We will encapsulate the data source into a standardized scenario , At present, we have 26 A standardized scenario , Supported 5000 Subscriptions to multiple rules .

img

There are three categories of subscription scenarios .

  • User behavior scenarios include 14 individual , For example, the scene of location class , It's about how long the user stays after entering an area , And terminal login , It's about which network the user is connected to ,4G still 5G network , And roaming 、 voice 、 Various scenarios such as product ordering ;
  • The scenarios in which users use classes are 6 individual , For example, how much traffic users use 、 What is the balance of the account 、 Whether there is arrearage, shutdown, etc ;
  • The scenarios for users to touch the Internet are probably 10 individual , For example, the handling of business 、 Recharge payment , There are new home access, network access, etc .

img

For real-time computing platforms , The requirement of real-time is very high . Data from generation to entry into our system , There are about 5~20 Second delay , After the normal processing of the system, there may be 3~10 Second delay , The maximum delay we allow is 5 minute , Therefore, it is necessary to monitor the end-to-end delay of the real-time computing platform .

The data that meets the requirements of user-defined scenarios shall be distributed at least once , There are strict requirements for this , Don't miss it , and Flink This demand is well met ; The accuracy of data needs to meet 95%, A copy of the data distributed by the platform in real time will be saved to HDFS On , Every day we will extract some subscription and offline data , Perform data generation and data quality comparison according to the same rules , If the difference is too large, you need to find the reason and ensure the quality of subsequent data .

img

This time I share more Flink An in-depth practice of enterprises in the telecommunications industry : How to use Flink Better support our needs .

A common platform cannot support our special scenarios , For example, take the scene of location class as an example , We will draw multiple electronic fences on the map , When the user enters the fence and stays in the fence for a period of time , And meet the user's set gender 、 Age 、 occupation 、 Revenue and other user characteristics , Such data will be distributed .

Our platform can be considered as a real-time scene , Clean the data in real time , Encapsulated into complex scenarios that support multiple conditional combinations , While intensively providing standard real-time capabilities , Closer to the business , It is simple and easy to use 、 Low threshold access mode . The business side calls the standardized interface , Authenticated by gateway , Before you can subscribe to our scene . After the subscription is successful, you will Kafka Some connection information and data Schema Return to the subscriber , The filtering criteria of user subscription are matched with the data flow , After the match is successful, it will be marked with Kafka And then distribute the data in the form of . The above is the process of interaction between our real-time computing platform and downstream systems .

Two 、 Evolution and practice of real-time computing platform

img

2020 Years ago , Our platform is to use Kafka + Spark Streaming To achieve , It is also a third-party platform for purchasing manufacturers , Encountered many problems and bottlenecks , It's hard to meet our daily needs . Now many enterprises, including us, are carrying out digital reform , The proportion of self research of the system is also higher and higher , Coupled with demand driven , Since the research 、 Flexible customization 、 Controllable systems are imminent . therefore 2020 In, we began to contact Flink, And the implementation is based on Flink Real time computing platform , In this process, we also realized the charm of open source , In the future, we will also participate more in the community .

img

There are many problems with our previous platforms . The first is the tripartite black box platform , We use the vendor's third-party platform , Too much reliance on external systems ; And in the case of large concurrency, the load of external systems will be very high , Unable to flexibly customize personalized needs ;Kafka The load is also particularly high , Because the subscription of each rule will correspond to multiple topic, So as the rules increase ,topic And the number of partitions will also increase linearly , The delay is relatively high . Each subscription scenario will correspond to multiple real-time streams , Each real-time stream will occupy memory and CPU, Too many scenarios will lead to increased resource consumption and excessive resource load ; The small volume is the support , The number of subscriptions supporting the scenario is limited , For example, during the Spring Festival, the number of subscribers has increased sharply , Often need emergency fire fighting , Unable to meet the growing demand ; Besides , Monitoring granularity is not enough , Unable to flexibly customize monitoring , End to end monitoring is not possible , There are many investigations using human flesh .

img

Based on the above problems , We have made a comprehensive self-study based on Flink Real time computing platform , Make the best customization according to the characteristics of each scene , Maximize resource efficiency . At the same time, we use Flink Reduce external dependencies , Reduces the complexity of the program , Improve the performance of the program . The optimization of resources is realized through flexible customization , Under the demand of the same volume, resources are greatly saved . At the same time, in order to ensure the low delay rate of the system , We conducted end-to-end monitoring , For example, it increases the backlog of data 、 Delay 、 Data disconnection monitoring .

The architecture of the whole platform is relatively simple , Adopted Flink on Yarn Mode of operation , External only depends on HBase, The data is based on Kafka Access and by Kafka Send out .

img

Flink The cluster is built independently , It's exclusive 550 Servers , Not mixed with offline computing , Because it requires high stability , Need daily average processing 1.5 Trillions of data , near 600TB Data increment of .

img

The main reason for our in-depth customization of the scene is the large amount of data , There are many subscriptions for the same scenario , And the conditions of each subscription are different . from Kafka When reading a piece of data , This data must match multiple rules , After matching, it will be distributed to the corresponding... Of the rule topic Inside . So no matter how many subscriptions there are , Only from Kafka Read data once in , This will reduce the risk of Kafka Consumption of .

The mobile phone will be connected to the base station when making a phone call or surfing the Internet , The data of the same base station will be compressed according to a certain time window and fixed messages , For example, a window in three seconds , Or the news reached 1000 Then trigger , In this way, the message received downstream will be reduced in order of magnitude . Then there's the fence match , The pressure of the external system is based on the scale of the base station , Not based on the number of messages . Then make full use of Flink The state of , When people enter and stay, they will be stored in the State , use RocksDB State backend reduces external dependencies , Simplify the complexity of the system . Besides , We also realize the association of 100 million tags, which does not depend on external systems . Through data compression 、 The fence matches 、 Entry dwell 、 After tag Association , We just started to officially match the rules .

After the user subscribes to the scenario , The subscription rule will be in Flink CDC The way to synchronize to the real-time computing platform , This ensures a low latency . Due to the entry of the crowd, the detention will be stored in the State , be based on RocksDB The amount of back-end data is relatively large , We will troubleshoot the problem by analyzing the status data , For example, whether the user is in the fence or not .

img

We customized HASH Algorithm , Without relying on external systems , Realize the association of 100 million tags .

In large concurrency , If each user wants to associate with an external system to obtain the information of the tag , Then the pressure of the external system will be very large , Especially in the case of such a large amount of data as China Unicom , The cost of relying on external system construction is also high , These tags are offline tags , The data is relatively stable , For example, there are more days and more months . So we use a custom hash algorithm for users , For example, there is a mobile phone number , According to the hash algorithm, it is assigned to index by 0 Of task_0 In the example , Then, through offline calculation, the mobile phone number in the tag file is assigned to the number of... According to the same hash algorithm 0 Of 0_tag in .

task_0 Examples are in open Method to get your own index Number , namely index=0, Then splice the tag file name 0.tag, And load the file into your own memory .Task_0 After receiving the mobile phone number, the instance can obtain the tag data of the mobile phone number from the local memory , It will not cause redundant waste of memory , Improved system performance , Reduce external dependencies .

When there are label updates , open Method will also automatically load new tags , And flush it into your own memory .

img

The figure above shows the end-to-end delay monitoring we do . Because our business has high requirements for delay , So we marked the event time , Like going in and out Kafka Marking of time , The events here are news . For operator delay monitoring , We calculate the delay according to the marking time and the current time , Not every message here is calculated after it comes , But by sampling .

The backlog and transmission interruption are also monitored , By collecting Kafka offset Make a comparison before and after , In addition, there is the monitoring of data delay , Use the time of the event and the current time to calculate the delay , The data delay of the upstream system can be monitored .

img

The figure above is the diagram of end-to-end delay monitoring and back pressure monitoring . You can see that the end-to-end delay is normal at 2~6 Between seconds , It's also in line with our expectations , Because the positioning conditions are more complex . We also monitored the back pressure , Through the monitoring operator input channel The utilization rate is used to locate the backpressure generated by each operator , For example, in the second figure, there is a serious back pressure and it lasts for a period of time , At this time, we need to locate the specific operator , Then check the reason , To ensure the low delay of the system .

img

The picture above shows our understanding of Kafka Each of the clusters topic The partition offset, And collect the consumption location of each consumer to locate its disconnection and backlog .

First, make a source To get topic List and consumer group list , Then distribute these lists to the downstream , Downstream operators can collect data in a distributed manner offset value , It's also used Flink Characteristics of . Finally write Clickhouse Analysis in .

img

Flink Daily monitoring mainly includes the following categories :

  • Flink Job monitoring 、 The alarm is connected to Unicom's unified alarm SkyEye platform ;
  • The running status of the job 、checkpoint It takes a lot of time ;
  • The delay of the operator 、 Back pressure 、 Traffic 、 Number of pieces ;
  • taskmanager CPU、 Memory usage ,JVM GC Monitoring of indicators such as .

3、 ... and 、 be based on Flink Cluster governance

img

We are also based on Flink Built our cluster governance platform . The background of building this platform is that our total cluster nodes have reached 1 More than 10000 sets , A single cluster has a maximum of 900 Nodes , in total 40 Multiple clusters , The total data volume of the single copy has reached 100 individual PB, Every day 60 Ten thousand jobs run , Of a single cluster NameNode The maximum number of files reached 1.5 Billion .

img

With the rapid development of the company's business , The demand for data is becoming more and more complex , The computational power required is also increasing , The scale of clusters is also growing , There are more and more data products , Lead to Hadoop Clusters face great challenges :

  • There are a large number of files nameNode Cause a lot of pressure , Affect the stability of the storage system .
  • There are a lot of small files , As a result, more files need to be scanned when reading the same amount of data , Lead to more NameNode RPC.
  • Many empty files , Need to scan more files , Lead to more RPC.
  • The average file size is relatively small , Macroscopically, it also shows that there are a large number of small files .
  • Documents will be generated continuously in production , The output file of the job should be tuned .
  • There are many cold data , Lack of cleaning mechanism , Waste storage resources .
  • High resource load , And the expansion cost is too big , The expansion can't last too long .
  • The operation takes a long time and affects the delivery of products .
  • The operation consumes a lot of resources , over-occupied CPU、 Cores and memory .
  • Data skew exists in the job , Resulting in very long execution time .

img

In response to these challenges , We built a system based on Flink Cluster governance architecture , By collecting the information of resource queue , analysis NameNode The metadata file of Fsimage, Collect the job and other information of the computing engine , Then do some work on the cluster HDFS portrait 、 Homework portrait , Data consanguinity 、 Redundant computing 、RPC Portraits and resource portraits .

img

Resource portraits : We will analyze the situation of multiple resource queues in multiple clusters at the same time, such as its IO、 metric Wait for minute level acquisition , You can view the resource usage trend of the whole cluster and subdivided queues in real time .

Store portraits : We conduct a global and multi-dimensional analysis of multi cluster distributed storage in a non-invasive way . For example, where the number of files is distributed , Where are the small files distributed , Where are empty files distributed . For the distribution of cold data , We have also refined the partition directory of each database and table .

Homework portrait : For the jobs of different computing engines in the whole product line of multiple clusters , We do real-time acquisition , From the time dimension 、 Queue dimension and job submission source , From time-consuming to resource consuming , Data skew 、 Large throughput 、 high RPC And so on , Find the problem homework , Filter out those jobs to be optimized .

Data consanguinity : By analyzing the production environment 10 Ten thousand grade SQL sentence , Draw a non intrusive 、 Overall 、 Highly accurate data, blood relationship . The data table level is provided in any cycle / Account level call frequency 、 Dependency of data table 、 Process change of production line processing 、 The influence scope of processing fault and the insight of garbage table .

In addition, we also made some portraits of user operation audit and metadata .

img

The above figure is a large screen of cluster governance storage . In addition to some macro indicators, such as the total number of documents 、 Number of empty files 、 Empty folder , And the number of cold directories 、 The amount of cold data and the proportion of small files . We also analyze the cold data , For example, what data was last accessed and how much data was accessed in a certain month , From this, we can see the time distribution of cold data ; And for example. 10 Under a trillion 、50 Under a trillion 、100 On which tenants are files less than megabytes distributed . In addition to these indicators , You can also pinpoint which library 、 Which watch 、 Which partition has small files .

img

The figure above shows the effect of cluster governance , You can see that the resource load has reached 100% The duration of is also significantly shortened , The number of files has decreased 60% above ,RPC The load is also greatly reduced . There will be tens of millions of cost savings every year , It has solved the problem of resource shortage for a long time , Reduce the number of machines with capacity expansion .

Four 、 The future planning

img

At present, we don't have a perfect real-time stream management platform , And the monitoring is relatively decentralized , It is imperative to develop a general management and monitoring platform .

In the face of growing demand , Although deep customization saves resources , Increased the scale of support , But its development efficiency is not ideal . For scenarios with a small amount of data , We also considered the use of Flink SQL To build a common platform , In order to improve R & D efficiency .

Last , We will continue to explore Flink Application in data Lake .


FFA 2021 Live playback & speech PDF download

more Flink Related technical issues , Can scan code to join the community nail exchange group
Get the latest technical articles and community trends for the first time , Please pay attention to the official account. ~

image.png

原网站

版权声明
本文为[ApacheFlink]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203011050249957.html