background
As a national level travel and life service platform with daily life of more than 100 million , It is the super large scale cluster in the background that carries the massive user service . From the user's perspective , If something goes wrong , The impact will be great .3 The remote deployment of computer room results in complex online environment , The link is complex . Under such conditions , How to avoid the user's injury caused by the fault , And make capacity planning under complex link conditions , Prepare for disaster , And find problems in the first place , It is very important to do emergency response through flow control and plan drill , And all the work can't wait for things to happen , We need to have a means of verification to do a good job of finding out the performance in advance , This is the full link pressure test , Let the real traffic come ahead of time .
As an important means to guarantee the stability of online services, full link pressure testing is an important means , It's also very important for Gaud . Gaode full link pressure testing platform TestPG From scratch , After a normal pressure test , It can basically guarantee all the full link pressure test and daily pressure test of Gaode , Achieved the platform early fast 、 The target of accurate pressure measurement and full link pressure measurement . And corpus production ( Traffic processing ) As an important part of the whole link voltage measurement , This article will focus on this .
A full link voltage test can be simply summarized as 3 Step : Flow treatment before pressure measurement ( That is to say, to produce corpus )、 Determine the pressure model in the pressure measurement and start the pressure measurement 、 Results analysis and problem location after pressure measurement . Every full link pressure test , The flow treatment before pressure measurement is the most time-consuming link in the whole pressure measurement process . In the past, the operation and maintenance collection log was often handed over to the testing students to write scripts for processing , It takes a lot of time 、 The cost is huge , And there are many problems such as request expiration . Based on these questions , Gaode full link pressure testing platform TestPG In the early stage, the corpus format of gaude stress test has been standardized , It unifies the flow processing flow of gaude pressure measurement . But with the evolution of Gode's full link pressure testing , There are two main problems in the follow-up :
- The production process of corpus is lack of unified control . Although the data format has been standardized in the early stage of the platform , But each business only deals with the traffic according to the standard of corpus , There is a lack of uniformity in the production process 、 Standardized control , As a result, the production cost of corpus is still very high . Especially for full link voltage measurement , Corpus preparation is the most time-consuming part .
- Precise pressure control at the interface level cannot meet the demand . As a national travel application , The flow depends on the weather 、 terrain 、 Holidays have a greater impact . Take driving navigation for example , Most of the daily driving navigation is short distance , And national day 、 Spring Festival is mostly a long-distance driving navigation , The requirement of long distance driving navigation on the back-end computing power is nonlinear , It's even multiplied . But the long and short distance driving navigation is the same interface for the pressure measuring platform , However, the current precise pressure control of the platform can only achieve the interface level , It is not possible to simulate the pressure test at the interface characteristic level .
Based on the above two problems , Gao De's whole link stress testing team set up a corpus intelligence project , Focus on solving the above related problems .
How to solve the problem
Drainage Standardization
Gaode's full link pressure test at that time has basically opened up most services , But it's still in an evolutionary phase . For corpus processing , It is mainly processed by each business and used for pressure measurement , The source of corpus processing is not uniform , journal 、ODPS、 Traffic and other processing sources are common . Unified control of corpus production process , The first thing we think of is the unified source of corpus processing , You have to choose a low cost 、 Efficient way as input of corpus production , And the way of recording traffic is very suitable . Through investigation and research , It is found that other business scenarios of Gaode also have great demand for traffic recording . But there was no uniform way of recording traffic in the past , The self copying traffic of each business line often causes the instability of online machines . So the first thing to do is to unify the traffic recording of Gaode , Standardized drainage .
The production of corpus is platform based
We should control the production process of corpus in a unified way , The above has unified the input of corpus production , The next step is how to translate the traffic into the corpus that conforms to the standard of the platform , Platform the whole transformation process . But for Golder business , Each business has its own characteristics , If the platform provides customized processing logic for each business, the cost is huge , In addition, the platform is not particularly familiar with various businesses , It's also easy to make mistakes . And there are some general processing logic in the whole process of corpus processing , So we have to provide a kind of customized support for various businesses , It can also meet the general processing logic of the platform . We finally chose to go through Flink To complete the flow processing logic .
Drainage has been standardized , The business side only needs to check the format and content of the traffic , To write Flink Of UDF( User defined functions ), To meet the needs of customized business , And the logic of subsequent General Corpus storage can be realized through Flink Of sink Plug in to complete . This can provide general processing logic , It also supports the special needs of the business , Good scalability .
The corpus is intelligent
As mentioned above, the national travel application of Gaode is greatly affected by various environmental factors , How to achieve accurate pressure control at the interface characteristic level , It was another big problem at that time . The platform has accurate pressure control at the interface level , Just sort the interfaces according to their characteristics , Provide the characteristic distribution of real traffic . But the characteristic distribution of traffic is real-time , How to provide the characteristic distribution in accordance with the traffic peak is the ultimate goal of corpus intellectualization .
To realize the intellectualization of corpus, we need to experience 3 Stages . The first stage is traffic characteristic statistics . We need to be clear about the factors that affect the flow changes , Reflected in the flow is the specific parameter distribution , Which parameters will change with the change of external environment . Of course, most of the business lines of Gaode have some rough analysis results , In the early stage, it can be used directly , Later, more fine-grained feature analysis is needed .
The second stage is traffic feature extraction . With specific characteristic parameters , It is necessary to extract and count the feature parameters , The follow-up can be used for intelligent prediction . But how to extract feature parameters ? After a comprehensive analysis, it is found that the most appropriate link in the production of corpus is . Drain copy traffic , Corpus production is used to deal with traffic , It is better to extract feature parameters in this link . And the whole corpus has good expansibility , The special needs of users can be achieved by UDF complete , The whole traffic feature extraction can be completed in the general logic .
The third stage is intelligent prediction and machine learning . With statistical data of characteristic parameters , You can make use of the traffic characteristics of the national day or Spring Festival in the past years , In addition, with the business flow trend this year , Intelligently forecast the data that accord with the flow characteristics of this year's national day or Spring Festival , To achieve accurate pressure test of interface characteristic level , To achieve the real full link pressure test , Guarantee the stability of the service of Gaode map . In the future, machine learning can also be used to automatically discover the characteristic parameters that affect the flow changes , Automatic acquisition and Analysis , Make the real meaning of the corpus intelligent .
The overall plan
The whole drainage work will be completed by the developed unified drainage platform , The drainage platform caches the traffic to Kfaka, Finally, it will drop to ODPS. And the whole corpus production service is directly connected with the drainage platform , Processing comes from ODPS That's enough .
The whole process of corpus production service is made up of Flink To complete . Users just need to write Flink Of UDF To complete the customized needs of their business lines . And the whole Flink Of UDF Support multi parameter transfer , Users can write UDF, In the execution process, the relevant parameters are dynamically passed , Solve problems such as request expiration .
Flink sink It's a platform developed Flink Source table parsing plug-in , It mainly includes the analysis and extraction of traffic characteristics , And write the produced corpus according to the interface name OSS For platform pressure test . At present, the characteristics of traffic are provided by the business lines themselves , By adding to the platform .Flink sink In the execution process, the platform is open API Get characteristic data for collection , Finally report to the platform , The platform then conducts machine learning based on these data , Intelligent prediction of traffic characteristics in line with the peak traffic , For full link pressure test .
Introduction to core functions
Iflow Drainage platform
Based on the above problem analysis , Gaode engineering efficiency team actively meets the challenge , In just a few months Iflow Drainage platform , The drainage of Gaode was under unified control , The details are shown in the following figure :
Iflow The drainage platform manages the drainage of Gaode in the way of task . At present, we use the way of drainage plug-in to copy traffic ( More drainage methods will be supported in the future ), Flow through Kafka cache , Final write ODPS For your use . Users only need to start from ODPS Extract the required data . The start of drainage needs the approval of the relevant person in charge , It is well known that related business , It can effectively reduce the cost of investigation after the accident caused by drainage .
TestPG The corpus is intelligent
The intellectualization of the corpus of Gaode full link pressure testing platform mainly consists of 3 Modules : Line of business management 、 Pressure test list management and interface proportion management . Line of service management is mainly used to manage the relevant data of each link of Gaode , Including associated drainage tasks 、 Start drainage 、 Drainage records 、 Corpus path 、 Pressure measurement header Manage and trigger the production of corpus . A service line is a pressure test link , From the drainage to the production of corpus and the analysis of corpus features, it is done in the dimension of business line . The details are shown in the following figure :
Function is introduced :
- Associated drainage tasks : It mainly completes the connection with the drainage platform task and configures the related parameters .
- Start the drainage task : Start the drainage platform task , After the completion of the drainage, corpus production will be triggered automatically , By executing user written Flink UDF And platform development Flink plug-in unit , Complete the production of corpus and the extraction of feature parameters .
- Corpus path : The platform will automatically generate the corpus path after each time the data production is triggered by the drainage , Users can choose when creating corpus .
- Pressure measurement header management : Each line of business has its own business characteristics , stay header It's also different , This is mainly used to manage pressure test http The service sent header Content .
- Trigger the production of corpus : The production of corpus includes 2 Ways , The first is to associate well with the drainage task. After the drainage task is started, the corpus production will be triggered automatically , Including feature parameter extraction and a series of operations ; Second, after successful drainage , Users may be interested in UDF And so on , You can also use this button to trigger corpus production .
Pressure test list management is mainly used to manage the pressure test interface . A company started to do pressure testing , Business must follow suit , What followed was business transformation , It's a long process . For the convenience of management , Gaode full link pressure testing platform manages the interface of Gaode in a unified way . The details are shown in the following figure :
The pressure test list is automatically reported during the drainage process , As long as the drainage finds the interface not in the pressure test list, it will automatically report to the pressure testing platform , The platform associates the corresponding person in charge according to the associated application , And push for confirmation . If pressure test is available, confirm it as pressure test list , The next corpus production will be used as the normal drainage of the white list . If pressure test is not possible, it can be divided into pressure free interface or interface to be followed up . The interface platform to be followed up will promote the business line transformation in the form of message notification , Finally achieve the real meaning of the interface coverage 、 Full link pressure test with full link coverage .
The interface proportion management is mainly used to manage BI Provided 、 And the interface proportion data which is close to the real situation for each full link pressure test adjustment , As a reference for subsequent full link pressure test . Later, the statistical data of traffic characteristics will be extracted through corpus production , Intelligent analysis predicts the proportion of traffic in line with the real situation , It can be directly used in full link pressure measurement , The details are shown in the following figure :
Platform advantages
Platform production of corpus
The whole corpus production is connected with the drainage platform , And pass Flink To complete . It not only supports the customization needs of the business side , It also supports the general processing logic of the platform , Good scalability . General logic passes through Flink sink To achieve , And added traffic feature extraction and other functions , It promotes the intelligence of corpus . Users just need to learn Flink complete UDF Compiling , Then complete the relevant configuration on the platform . To a great extent, it improves the efficiency and quality of corpus production , It is a great leap from the standardization of corpus format to the standardization of production process .
The corpus is intelligent
The platform is in the whole process of corpus production , adopt Flink The plug-in completes the statistical summary of characteristic parameters . At present, users only need to complete the configuration of relevant features on the platform , In the process of corpus production, the platform will analyze the features and make statistical summary . With statistical data of characteristic parameters , It will contribute to the subsequent intelligent analysis and prediction of the platform , Accurate pressure control at the interface characteristic level , Finally, the full link pressure test is achieved .
At present, the platform has completed the automatic production of corpus , In addition, the work related to the intellectualization of corpus is added . The whole pressure test list is also automatically reported through drainage , In the future, it will automatically open the business line and solve the problem through message notification . The interface proportion management module also supports the display and adjustment of interface proportion , Finally, intelligent prediction of corpus features is made , Then we can produce the corpus which accords with the real characteristics of traffic peak . All of these will promote the evolution of intelligent full link pressure measurement of Gaode .
Future outlook
It has been some time since the corpus intelligence of Gaode full link pressure testing platform has been developed , Through our unremitting efforts , The intellectualization of corpus has completed the automatic production of corpus , And the collection and extraction of feature parameters , It has laid the foundation for the subsequent intellectualization . In the future, the platform will analyze the feature data collected by machine learning , According to the characteristics of the previous year's peak flow , With the change trend of this year's flow, the characteristics of this year's flow peak are predicted , Achieve accurate pressure control at the interface characteristic level , Fully simulate the real flow pressure measurement to achieve the real meaning of the full link pressure measurement .
Besides , The platform will use machine learning to automatically analyze and discover the parameters that affect the flow changes , Automatic extraction analysis , Improve the accuracy of corpus production .
The platform will also have a confidence assessment system , Compare the real traffic characteristics with the predicted traffic characteristics , Analyze the cause of the error , Further improve the accuracy of prediction , To achieve full real flow production . Follow up with the precise pressure measurement of the platform 、 The functions of pressure model and monitoring can achieve the real sense of unmanned 、 Intelligent full link pressure measurement .