
Reading guide : Hello everyone , Today, we mainly share the platform evolution of the data analysis platform and how some data analysis methods we precipitated above are applied .
It is divided into the following four parts :
- Part1: Mainly introduce my department , What does the data platform department mainly do , What businesses are involved , What is the responsibility of the data platform department in the whole data process ;
- Part2: Since we are talking about the data analysis platform , So what is data analysis like , What is the field of data analysis ;
- Part3: How does ant's current data analysis platform come from , How to evolve to the latest version , In the latest version 3.0 There are some technical details ;
- Part4: Now that we have a data analysis platform , So what can data analysis help us , A specific application in Engineering case.
--
01 Introduction to data platform Department

First of all , Introduction to data platform department , Start with the whole data flow , The data flow starts from data collection and transmission , This involves, for example, online RDS,OB These are online business databases ; journal , For example, online applications , The file logs typed on the machine ; There is also some news , Some messages written by online applications ; There are some documents , External documents . After data collection , Data synchronization , Enter our data warehouse system , There may be a lot of data synchronization , such as DB Log resolution synchronization for DRC、 Log file parsing 、 collection SRS, Then there are some common synchronization tools DataX.
second , In data storage and calculation , See the picture above from bottom to top , First, there are many 、 Traditional batch computing , It's like ODPS,Spark, And some of the latest frameworks , such as Ray,Ray The ant variant is Raya. The second is real-time stream computing , There are some examples in the industry storm,JStorm, Ants have Kepler,Spark Streaming These things . The third is vertical above this , There are some machine learning scenarios , Yes PAI, Yes TersonFlow Such things are inside . Fourth , In this system, users are most exposed to one-stop Data R & D platform and one-stop AI R & D platform , They are for data warehouse 、AI Two systems to do .
Last , After the storage and calculation are completed, it should be oriented to the application scenario , For the last consumers , This intermediate application , For example, there are report displays , Data analysis ( Today we will focus on data analysis ), There are also some mining predictions , Is to do algorithms , Make a model , There are also data decisions , Is to take the data as an online decision , This is the entire data flow .
The data platform department focuses on the back , Data storage and computing and data application and consumption . The following focuses on two aspects , What businesses does the data platform department have .

This picture is a business architecture , It refers to the businesses involved in the data platform department , In total, we are divided into 3 layer , We call what our data platform department is doing a data operating system , We have two pieces , One is the data operating system kernel , One is the software that users come into contact with . There are also scenes outside .
1. The kernel of the data operating system
- Basic framework
What is in the basic framework , Why is there him , For example, multi environment adaptation , Because our complete set of data platform solutions are exported , There is a public cloud environment , There is a proprietary cloud environment , The infrastructure under these environments is different , For example, including tenants and account system , Permissions system , Process system , Such things as approval process , So it is through the basic framework that we build our underlying environment . The main purpose is to provide some common capabilities of our upper data applications and shield the differences of the underlying data environment .
- Core competencies
① Data security : Data security involves the classification of data assets 、 classification . Different categories of assets , His safety level is different , If he needs permission in security , His approval strategy is different , This is the part of data security , It may also involve such things as desensitization , How can we desensitize our consumers when they are exposed to these data ;
② Privacy protection : More emphasis on privacy protection , For example, privacy protection is also called data security 、 Data compliance , What we want to do , That is, we should see the data flow of each company transparently , For example, what data are there , What is the security level of these data , What data is involved in the user ;
③ Data quality : Mainly in the process of data research and development , Data cycle from publishing to online scheduling , How to monitor data quality after scheduling , After the test , For example, when we do offline scheduling, the most important thing is the timeliness of data output , So there is a baseline . This is how to guarantee the baseline of our mission ;
④ Metadata Center : The metadata center is well known , Because there are all kinds of different engines under us , Yes Spark, Yes ODPS, Yes MySQL These things , How to unify the data in the metadata center ;
⑤ Data governance : The logic of data governance is to clarify our existing data in line with data quality .
- Data engine
① Task execution and scheduling engine : We're doing it ETL Most of the time, it is the task execution and scheduling ;
② Data science engine : The data science engine mainly does analysis , Do business insight , Today's data business platforms may rely more on data science engines , More on that later ;
③ Decision service engine : For example, the decision engine gives you a scenario , Everyone knows the sesame seed , First of all, suppose I have a business online , When making strategies online , Or when I show you different pages , Different levels of sesame seeds can see different pages or levels , This kind of thing needs data decision , Or to put it bluntly , It is the sesame seed that needs this person , The statistical data service will configure a decision rule , It is equivalent to that the decision engine here supports a kind of decision DSL To configure , The short answer is if……else……,if…else……, After such a set of rules can be configured , Provide services for online business scenarios , This is the decision service engine . There are so many things in the whole data kernel .
2. Desktop of data operating system
On this basis, we have built a user oriented data workbench, which mainly includes :
① External data acquisition platform : Because we have a lot of numbers , For example, word of mouth , There is a key factor in the rise and fall of word-of-mouth trading volume , The weather , So I need external weather data , So this is an external data acquisition platform ;
② Asset management platform : It is equivalent to the metadata center here , We need to standardize and manage all the data in our system , In our R & D process, he must go to the data asset management platform to standardize the tables he built this time ;
③ Data R & D platform : The data R & D platform should support multiple engines 、 Batch flow integration , We write a unified SQL, It can switch to batch ODPS transfer , You can also switch to real-time , Switch to, for example, in our system Kepler, Switch to Spark Streaming Go up and dispatch , This is what the data R & D platform should do . He may rely on the task execution scheduling engine ;
④ Data analysis platform : It mainly does some multidimensional analysis and self-help multidimensional analysis , Also do some intelligent business insight ;
⑤ Data decision platform : Provide data capability for online business . Then there is the data experiment platform , The experimental concept is A/B experiment , I cut an algorithm today , You can cut it on this 1% Of traffic to this algorithm , in addition 1% Of traffic to this old algorithm . Compare their effects 、 Significance . Do some confidence interval analysis , Let's see the effect of this algorithm , Because the concept involved in the experiment is , The same algorithm cuts 1%, If an effect is 98%, One is 95%, If there is no scientific test , There is no way to explain 98% The three points of are caused by sample error , Or is it my algorithm , So the experimental platform solves this problem .
There are some vertical scenario services on this , For example, ant's data products reveal some end-to-end capabilities , We can view our data on the mobile terminal .
The second block has some vertical solutions , For example, the crowd portrait platform 、 Location services .
The third is the developer center , It mainly deals with a scenario called openness .
This is from the data operating system kernel to the data system desktop , And then to the data business scenario . The general business scope of the data platform department is as follows .
--
02 Introduction to data analysis field
The word "data analysis" has been talked about a lot , What about data analysis , In fact, there are many examples of data analysis around us , Let me give you an example , Then let's take a look at the data analysis architecture .

The data analysis phase includes :
- ① Descriptive analysis stage ;
- ② Diagnostic analysis stage ;
- ③ Predictive analysis stage ;
- ④ Guided analysis stage .
Guiding analysis , He may have two paths , First, he is a decision-making assistant , It tells you what to do , Whether to make a decision or not , Finally, we can generate action , There is also an online machine learning , I can make the machine switch parameters automatically , Do some effect improvement , The next step is that the machine automatically . Therefore, different stages and levels of data analysis , There will be fewer and fewer people involved , More and more machines will be involved , But its value is growing , More and more complicated , That is, from hindsight to construction and then to stability . That's the process , This is what we understand as data analysis . This field is like this , So data analysis is not just four words .
--
03 Data analysis platform
After the data analysis , Let's introduce the ant data analysis platform , Its evolution history and the latest 3.0 What's in the version .
The ant wears Yang Jun in gold : The evolution of ant data analysis platform and the application of data analysis methods
Speaking of the birth of the data platform , Let's talk about traditional data analysis , Its contradictions are :
- ① Variable reporting requirements ;
- ② The implementation cycle of process requirements is long ;
- ③ Development resource bottleneck ( Long technology schedule ).

With this contradiction , Data analysis platform 13 In, there was a 1.0 edition , It can be considered as a reporting tool , The presentation layer can be dragged by itself , For example, encapsulating the concepts of dimension and measurement , What fields to drag to the dimension , Drag what field to measure , Then find out the data , Is to generate a query through the presentation layer , Finally, the query is transformed into SQL Check the data source below . But at that time, most of the data were in a relatively slow ODPS, The performance is unacceptable to users , Another is the permission module .1.0 Version can be understood as a simple report tool , His query ability is not very complete .

1.0 After the version , There are contradictions :
- ① Insufficient analysis function ;
- ② Insufficient analysis performance ;
- ③ The data capabilities are split from the business workbench .

In this case , We did 2.0,2.0 The yellow part of the version is something new :
- ① Data sets : I am trying to support some more complex analytical models . You can make some star models , Snowflake model , Make an association data set ;
- ** ② Multidimensional analysis **: This piece is specially made for Mondrian, use MDX This language does multidimensional analysis ;
- ③ Automatic acceleration of the system : In fact, it is from the previous data RDS, As long as it is introduced into the data set . As soon as its data set changes , I'll sync it to ODPS, This step is to accelerate , So when querying , If he has accelerated , I will route it to the previous data source ;
- ④ to open up : The earliest opening was relatively simple , Such as iframe The embedded , Or data query interface , Just these two abilities ,iframe By embedding, you can embed his reports into your own business workbench , Don't leave his platform . And inquiries , Inquiries are open to him , It will be easier to assemble his process . because iframe Embedding can only be embedded on the whole page .
This is the data analysis platform 14 Year to 16 Year of 2.0,14 Year to 16 In, we actually iterated on this chart , To enrich a lot of abilities , Including the ability of e-mail subscription to deal with some scenarios of weekly and monthly reports .
After this , Combined with our previous understanding of data analysis , In fact, we want to redefine the analytical insight .

stay 17 In the year , Let's do it . From descriptive analysis to diagnostic analysis to prediction to guidance . In this picture, we are still in the use of descriptive analysis , Let's analyze what our users are like .

It is divided horizontally into three sections , Customer competency stratification , To what role he is , To his ability . We divide the users of the data analysis platform into two categories , One is B People who do data analysis on the end business side , One is C A person who looks at the results of data analysis and makes decisions .

1、 Scene application layer
2、 General layer
① visualization : Users define their own visual components ;
② Analysis of algorithm : Custom analysis algorithm operators ;
③ Analysis insight solutions : In a larger scale, these original algorithms are packaged into an analysis process .
3、 The core competence of the middle office
① Collaboration ;
② Query route ;
③ Scientific computing engine ;
④ Connectors for different engines ;
⑤ Intelligent precomputing ;
⑥ Intelligent synchronization .

The following may be a detailed part of the data analysis platform's intermediate technology . What are his core competencies , Mainly look at the following one .
1、 Open service facade , Whether it's SDK,API still DSL, The most important thing about the data science platform is that there is a major data analysis language , This data analysis language includes the ability of data analysis , Contains algorithmic capabilities , He can adjust the operator of an algorithm , Put one SQL The result is to adjust the operator of an algorithm , Adjust the operator of the algorithm and then do multidimensional analysis . With the data analysis language , We will provide some capabilities in the data science platform , For example, light processing capacity , Multidimensional analysis , Scientific analysis ability , And the ability of compound analysis , After that is running , After running, I will route the analysis process expressed in language to the following engine for execution , Optimize the execution process , Then it can be adapted to the multidimensional engine .
2、 Core competencies
Under this, there are three core competencies :
① Intelligent synchronization center : The biggest purpose or problem of the intelligent synchronization center , It is to accelerate the user to a fast data source before accessing the data as much as possible , If it's slow , What he sees is old data , He came to visit our platform , What he saw was the data I accelerated yesterday , So the intelligent synchronization center is to solve this problem ;
② Intelligent precomputing : We found that we have many reports , Because the items dragged out of the report are fixed , Yesterday and today are just different dates , So we will do some pre calculation for him in advance , Help him calculate and save it in advance ;
③ Execution engine : The execution engine needs to adapt the above language , Some advanced analysis can be performed here , Then multiple source data engines adapt to it , The core competence of the data analysis platform is based on these keywords . The first is intelligent , One of them is that the data analysis methodology provided by our object is intelligent , The other is that we have some engineering capabilities here ; The second is self-service , We want users to serve themselves on the platform ; The third is end-to-end , I hope that no matter what the user does , He needs data skills , Don't jump anywhere else , He can solve problems one stop ; The fourth is embedded , That is, it can be enabled to various business platforms , These are the four keywords in the core competence of data analysis , Then there are some basic details , Mainly talking about this layer of things .

The first is the query , How to execute a query in the data analysis platform , First of all, there are many scenarios we query , For example, visualization 、 Intelligent enhancement analysis 、 Intelligent people , These query models are uniformly translated into a data analysis platform called based on Dataset Of Logical Plan. In this Logical Plan It depends on data set metadata 、 Line level authority ( The same data set , Different people can only see different lines , This is row level permission ).
After that, the metadata based on dataset is translated into the logical execution plan based on table Table Logical Plan, Table based Logical Plan, We get the metadata of the table , Then translate later , Because a piece of data can be seen , The accelerated process may speed up a piece of data to different engines . The reason is that he deals with different analysis scenarios , Some engines can quickly support the visualization of multidimensional analysis , Some engines can support intelligent enhanced analysis , So one piece of data uses multiple engines , ad locum Table Logical Plan Translate into DataSource Logical Plan, It means that a specific element is selected , There may be some cache here 、 Speed up routing 、 Precomputed route , There are also rules and functions .
After selecting multiple data sources , After a cost model , Select the best data source and execute it . There are many factors considered in the cost model , For example, query features , This time, group by How many fields , What is the dimension count of these fields , How many count,distinct. The second data feature , Is what the data distribution looks like , Third, there are some user characteristics , For example, the senior executives of ants have higher priority , Will give him some faster engines .
After selecting an optimal data source in this way , There will be a layer of abstraction , We will be right. DataSource Conduct SPI abstract , There are MetaData Metadata 、 Connection ability 、 Executive ability 、 Dialect conversion ability 、 Have authority control ability , This dialect means the same query ,MySQL grammar ,ODPS Grammar or hive The grammar is totally different , So dialect conversion is the adaptation from the same language to various languages .
You have this layer SPI After abstraction , We will adapt to many Plugins,Plugins It can be loaded dynamically , as long as Plugins Loading in , We support the query of this data source , Finally, execute the query , This is the whole query process of the data distribution platform .

Just mentioned acceleration , It's synchronization , stay 3.0 We call it intelligent synchronization , Just now I told you what problems intelligent synchronization can solve . I try to speed up the data to the right engine as soon as possible before the user accesses it , Why accelerate to the right engine , Because there are different demands for analysis on this table , For example, he has multidimensional analysis , With advanced analysis , Or do some algorithm models , Different engines can support different scenarios , When is it triggered , It may be triggered by the user , It may also be triggered by a scheduled task , And data changes , Whether the metadata or the data has changed .
After that, do synchronous verification , There may be some dosage control , There are some user rights controls , After verification, an intelligent strategy will be adopted , Smart strategy is just one thing , Match the scenario to the strategy , for instance VIP scene ( The executives just mentioned ); There are also query feature function scenarios , Look at the query features on this table , For example, he does multidimensional analysis and query or does algorithm ; There are also query features , What does query feature mean , For example, he often uses a certain field to do where Conditions , often group by A field , Some of the strategies for coping with that are VIP report form , I want to ensure that executive users , I will accelerate a table to multiple metadata , It is possible to accelerate a table to multiple destination surfaces , Build different depth formats for it in the same element , For example, the user table , The first user table often performs multidimensional analysis , Second, it is often used to join, This is a very common use uid Follow the trading list join, When I synchronize the user table, there will be one table with multiple destinations , First, synchronize a basic multi-dimensional analysis , Synchronize a copy according to uid hashed , In advance according to uid After hashing my join More efficient , Similarly, the transaction table will also be in accordance with uid hash , So this is one table with multiple destinations . And table structure optimization , For example, synchronize to MySQL, It is found that he often has a small amount of data , for instance 20 ten thousand 、100 The data volume is less than 10000 , I'll sync him to MySQL Go inside , I found that his query feature often uses a certain field to do where, I put an index on this field , This is table structure optimization , This may be similar to query routing , Have query characteristics , The data distribution , What features does this data source support , With all this , Some synchronization priorities will be set .
Synchronization priority is queued in a distributed queue to be executed , The last step is to synchronize the task execution , Just two layers of things , One is the synchronization source , Is to synchronize where , And where to synchronize , Synchronization target , stay SPI After abstraction, the query idea is similar to the previous one , Go back and realize a lot Plugins, You can sync from here to there , This is the technical explanation of intelligent synchronization .

The last one is the intelligent precomputing mentioned earlier , Kylin Everyone has heard of , We first learned from Kirin's thought , First, there are many reports in the data analysis platform , These statements are clearly curable ; Second, there are many tables in the data analysis platform that are used by everyone , There are many people in a business department , These watches will be shared by everyone , In the process of dragging, there are many analyses that overlap , Therefore, pre calculation is quoted .
Pre calculate how the whole process is , For example, in the first step, I will do information collection , Information collection comes from several parts , For example, report structure , Defined dataset structure , For example, define tables and do tables join analysis , The third is historical query , Drag and drop of history . With these, I will extract features , Extracting features has dimensions , There are ordinary measures ,distinct Measure , And the watch / Subquery , Which watch is it , Which subquery is it , What are his screening criteria , What is his time consuming . With these characteristics , I will make a concept called cube , Namely Cube Design, In this process, we design the cube , Designing cube logic is simple , It is to build these dimension measures of the same table and the same sub query into a tree , This is the most detailed dimension , Fine grained, for example group by 4 A field , I can sum it up to group by Three fields 、 Two fields , Or I can sum it up into group by Two fields , The result of a field , In this way, a Cube. After a tree is built, it does not mean that all nodes of the tree are calculated for him , Because the dimension combination cannot be calculated , So do something Cube Planner, Go and do some pruning , Which rules I don't want , For example, rule-based , For example, it takes less than three seconds or less than one second , I won't help you build it , Because your engine can satisfy you , There are also some greedy algorithms , Do some optimization to maximize the benefits of this tree . Then we have to do physical construction , The physical build is the same , Under the ants, the engine design involves multiple engines , We all want to do this , We can see in the three core technical details SPI abstract , But here SPI The inside of the box is different . Here we build the engine SPI There are incremental builds , Full build , There is a single point of construction , There is also urban construction , And fast build , These different abilities . With these abilities , for instance ODPS,Spark these two items. , To do the final build , In this way, when the route is queried after the construction , It will be routed to the metadata that has passed through the intelligent precomputing center , Route to an optimal , Calculated . The best ones are generally group by The least one , Intelligent precomputing is this .
The following row is for the most recent example above , The core capabilities of the data platform and several technical details mentioned above , With these, our data platform has some results .

--
04 Data analysis applications
After the data analysis platform has these technologies , What can he do to help us , Or what is his routine if I use the data analysis platform to help me , An example is data analysis driven data analysis platform skill optimization , This is an example of using data analysis to drive engineering optimization , The first step is to see what the problem is .

Different people expect to be raised to the second level , There are also individual report queries to 90 second , This is the way to go 0DPS This kind of inquiry , Very slow to reach the minute level , So people complain RT The problem of , The user's expectation is to reach the second level , But we know it's like stability , The actual situation is impossible 100% Reaching the second level , There are always some exceptions and things you can't consider , This is the problem , We need to solve this RT.
Next we have to solve a problem , To make this problem measurable , I want to be able to measure it, which is also easy to optimize it , I also know the extent of the solution , The second part is to define indicators .

Indicators are just like what I just said , We can't do it 100%, So we define oneortwo indicators , One is the experience index , One is the bottom line indicator , Experience indicators are queries RT To reach the percentage in one second 98%, The bottom line indicator is RT stay 10 In seconds 100%, because 10 We still have confidence in the limit of seconds . Why is it called experience index , This is actually related to most users , He can feel , Why should there be a bottom line indicator , A small number of people will drag the platform to death with the growth of platform users , He comes to trouble you every day , As the number of users increases, more and more people will find you , So there is a bottom line indicator . This involves defining an indicator , A good indicator should be easy to understand , A good indicator should be a ratio , Good indicators can guide behavior change .

We rely on business processes and physical architecture to decompose , This is a simplification of the data analysis platform , From visualization to server to data query language , This part is the request link perspective , Horizontal is the logic module perspective , For example, what are the possible data sources , What is the process of querying a column , With this knowledge, we abstract the data .

After decomposition, it should be abstracted mathematically , In fact, you can see from the above picture just now , With cache , There is pre calculation , Yes RDS, There are different engines , With these engines, I'll take my share in one second RT Break it down into this formula , The denominator is the total number of queries , The numerator is the query quantity in one second , I can disassemble the query volume in one second according to the engine , After dismantling, each engine represents the number of engines in one second ,X1 Until X8, Are the number of queries in a second for different elements , When a certain element is determined , I can disassemble it according to the link , For example, pre calculate what links I have passed , For example, handle row level permissions first , The next step is to process the pre computed route , Then it is to query the data source , That's the logic , With this abstraction , We can do data analysis .

This is our abstraction , After abstraction, we take out our data . For example, select a certain yuan , I went to see his statistical histogram , The horizontal axis is time consuming , It takes time to find problems , The vertical axis is his number , How many times does he have in a second , How many times in two seconds , Obviously, with this picture, we can easily see something , Circle the places with intervals in the picture first , For example, I want to solve this paragraph now ( peak ), Solve this group of people , I'll circle it out first , Do multidimensional analysis , Then find out why , After finding out the reason , If I optimize this place , How much can I improve my overall indicators , For example, the proportion of one second , In ten seconds , I can predict how much the local optimization can improve my overall indicators , In this process, we can find out why we look at the improvement of the overall indicators , Because our manpower is always limited , I'm going to evaluate ROI Output ratio , It must be a small investment to do this indicator first .

for instance , In the process , This time, we'll circle him out first , Found a data source ( Our internal name is ADS), Found this ADS This is the most , In this range ADS Reached 900 many times , Let's circle this one out to see his other loopholes , I found myself drilling down a dimension , Run in query_mode What is the query type , I find count_distinct Proportion 92%, That is, the reason for this paragraph is ADS This source is count_distinct no way . In fact, this finally found the reason , After that, let's judge how much this point affects the performance of our whole slow query , Overall slow queries can take up 20%30% The appearance of , That is to say, I just need to optimize this , My overall indicators can be improved 20%30%, This may be the specific optimization found through the idea just now , This is just one of them .

To sum up, data analysis needs to make some things , His routine is like this , First, we need to define the problem , What problem do you want to solve , Then you have to measure the problem ( Index definition ), Peter Drucker once said , There is no good measure , There is no way to grow , So we have to define it first , Define if you only have this indicator , You can do nothing , Only monitoring , You can only use it to prove your idea , So we have to do mathematical abstraction , From some service links , Make some abstractions from some system modules , After the abstraction, check whether there is corresponding data ( Collect data ), When you have the data, do the analysis , Whether it's descriptive analysis 、 Diagnostic analysis or predictive analysis , Use analytical methods to find the cause , Then make decisions and act . This process is difficult , Of course, it is difficult to make a decision , First, you should have a strong understanding of this business field , Second, you should judge whether the results of data analysis conform to the business understanding . The second difficulty is the data abstraction , This part requires you to have a deep understanding of the business , Whether from the link or from the module , If you solve engineering problems , You need to have an understanding of the system , If you solve business problems , Have a strong understanding of the link , This is the data analysis application pattern , To sum up, this is the routine .
Today's sharing is here , Thank you. .
This article was first published on WeChat public “DataFunTalk”.







![[solved] vagrant up solution to slow download box](/img/1b/c15c7d34d5b73c64d79812020dd527.png)

