当前位置:网站首页>Exploration of kangaroo cloud data stack on spark SQL optimization based on CBO
Exploration of kangaroo cloud data stack on spark SQL optimization based on CBO
2022-06-10 16:12:00 【51CTO】
Link to the original text : Kangaroo cloud stack is based on CBO stay Spark SQL Exploration on Optimization
One 、Spark SQL CBO Selection background
Spark SQL There are two ways to optimize : One is rule-based optimization (Rule-Based Optimizer, Referred to as RBO); The other is cost based optimization (Cost-Based Optimizer, Referred to as CBO).
1、RBO It's traditional SQL Optimization techniques
RBO It is a relatively early and mature project SQL Optimization techniques , According to a series of optimization rules, it can optimize SQL Syntax expression for conversion , Finally, an optimal execution plan is generated .RBO It belongs to an empirical optimization method , Match in strict accordance with the established rules , So different SQL The writing method directly determines the execution efficiency . And RBO Not sensitive to data , When the table size is fixed , No matter how the intermediate result data changes , as long as SQL remain unchanged , The generated execution plans are all fixed .
2、CBO yes RBO Improve the optimization method of evolution
CBO It's right RBO Improve the optimization method of evolution , It can convert relational expressions according to optimization rules , Generate multiple execution plans , According to statistics (Statistics) And the cost model (Cost Model) Calculate the least costly physical execution plan .
3、 CBO And RBO Comparison of advantages
● RBO Optimization example
Let's take an example : Calculation t1 surface ( The size is :2G) and t2 surface ( The size is :1.8G)join Number of lines after 
Above, :
SELECT COUNT(t1.id) FROM t1 JOIN t2 ON t1.id = t2.id WHERE t1.age > 24
be based on RBO The physical execution plan generated after optimization . In the picture we can see , Finally, the implementation plan is to select SortMergeJoin ⑴ Perform two tables join Of .
stay Spark in ,join There are three implementations of :
1.Broadcast Join
2.ShuffleHash Join
3.SortMerge Join
ShuffleHash Join and SortMerge Join Need to be shuffle, relative Broadcast Join It will cost a lot more , If you choose Broadcast Join The size of a table must be less than or equal to
spark.sql.autoBroadcastJoinThreshold Size ( The default is 10M).
And let's see , The execution plan in the figure above t1 surface , Original table size 2G After filtration 10M,t2 Table original table size 1.8G After filtration 1.5G. This explanation RBO The optimizer does not care about changes in intermediate data , Only based on the original table size join The choice of SortMergeJoin As final join, Obviously, the execution plan is not optimal .
● CBO Optimization example
While using CBO The execution plan obtained by the optimizer is as follows : 
It's not hard for us to see ,CBO The optimizer takes full account of intermediate results , Perceiving the change of intermediate results can Broadcast Join Conditions , Therefore, the final execution plan generated will select Broadcast Join To do two tables join.
● Other advantages
In fact, in addition to the rigid execution, it leads to the problem that the optimal solution cannot be obtained ,RBO There is also the problem of high learning costs : Developers need to be familiar with most optimization rules , Or write it out SQL Performance may be poor .
● CBO It's stacks Spark SQL Better choice for optimization
be relative to RBO,CBO No doubt it's a better choice , It makes Spark SQL The performance of has been improved to a new level ,Spark As one of the most important components in the bottom layer of the stack platform , It carries most of the tasks on the offline development platform , do Spark The optimization of will also promote the stack to be more efficient and easy to use . So several stacks are selected CBO Do research and exploration , This will further improve the performance of several stacks of products .
Two 、Spark SQL CBO Realization principle
Spark SQL To realize CBO The steps of are divided into two parts , The first part is the collection of statistical information , The second part is cost estimation :
1、 Statistics collection
The collection of statistical information is divided into two parts : The first part is the original table information statistics 、 The second part is the information statistics of intermediate operators .
1) Original table information statistics
Spark in , By adding new SQL grammar ANALYZE TABLE To count the original table information . The original table statistics are divided into table level and column level , The specific implementation is as follows :
● Table level statistics
Through execution ANALYZE TABLE table_name COMPUTE STATISTICS Statement to collect , Statistical indicators include estimatedSize The size of the decompressed data 、rowCount Total number of data, etc .
● Column level statistics
Through execution ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column-name1, column-name2, …. Statement to collect .
Column level information is divided into basic column information and histogram , Basic column information includes column types 、Max、Min、number of nulls, number of distinct values, max column length, average column length etc. , Histograms describe the distribution of data .Spark Histogram statistics is not enabled by default , Additional parameters are required :spark.sql.statistics.histogram.enabled = true.
The information statistics of the original table is relatively simple , It is relatively complicated to calculate the statistical information of intermediate nodes , And different operators have different calculation rules , stay Spark There are many operators in , Interested students can see Spark SQL CBO Design document :
https://issues.apache.org/jira/secure/attachment/12823839/Spark_CBO_Design_Spec.pdf
2) Information statistics of intermediate operators
Here we take the common filter For example, operators , Look at the process of calculating operator statistics . Based on the previous section SQL SELECT COUNT (t1.id) FROM t1 JOIN t2 ON t1.id = t2.id WHERE t1.age > 24 Look at the generated syntax tree t1 The table contains the greater than operator filter Node statistics . picture
There are three situations to consider here :
The first one is
The constant value of filter condition is greater than max (t1.age), The return result is 0;
The second kind
The constant value of filter condition is less than min (t1.age), Then all return to ;
The third kind of
The filter condition constant is between min (t1.age) and max (t1.age) Between , When the histogram is not enabled, the formula of filtered statistical information is after_filter = (max (t1.age) - Filter condition constant 24)/(max (t1.age) – min (t1.age)) * before_filter, If the histogram is not enabled, the task data distribution is uniform by default ; When the histogram is enabled, the filtered statistical information formula is after_filter = height (>24) /height (All) * before_filter. Then the node min (t1.age) Equal to the filter condition constant 24.
2、 cost estimation
After introducing how to count the statistics of the original table and how to calculate the statistics of the intermediate operator , With this information, we can calculate the cost of each node .
Before introducing how to calculate node cost, we will first introduce the meaning of some cost parameters , as follows :
The node cost will be calculated from IO and CPU Consider two dimensions , The cost calculation rules of each operator are different , We go through join Operator to illustrate how to calculate the cost of the operator :
hypothesis join yes Broadcast Join, Large tables are distributed in n A node , that CPU The price and IO The cost calculation formula is as follows :
CPU Cost = Small table construction Hash Table Cost of + The cost of large table detection = Tr (Rsmall) * CPUc + (Tr (R1) + Tr (R2) + … + Tr (Rn)) * n * CPUc
IO Cost = Cost of reading small tables + Cost of small table broadcasting + Cost of reading large tables = Tr (Rsmall) * Tsz (Rsmall) * Hr + n * Tr (Rsmall) * Tsz (Rsmall) * NEt + (Tr (R1)* Tsz (R1) + … + Tr (Rn) * Tsz (Rn)) * Hr
But whatever operator , Cost calculation and the total number of data involved 、 Factors such as the average size of data are directly related , This is why we should first introduce how to count the information of the original table and the statistics of the intermediate operator .
Each operator calculates the cost according to the defined rules , The sum of the costs of each operator is the total cost of the entire implementation plan , Here we can consider a problem , Is the optimal execution plan obtained by enumerating the total cost of each execution plan one by one ? Obviously not , If the total cost is calculated once for each execution plan , I guess the day lily is going to be cold ,Spark The idea of dynamic programming is cleverly used , Get the best execution plan quickly .
3、 ... and 、 Count the stack in Spark SQL CBO Exploration on
To understand the Spark SQL CBO After the implementation principle of , Let's think about the first question : The big data platform wants to support Spark SQL CBO Optimized words , What needs to be done ?
In the previous implementation principle, we mentioned ,Spark SQL CBO There are two steps in the implementation of , The first step is to collect statistical information , The second step is cost estimation . The collection of statistical information is divided into two steps : The first step is the original table information statistics 、 The second step is information statistics of intermediate operators . Here we find the answer to the first question : The platform needs to have the function of original table information statistics .
After the first problem is solved , We need to think about the second question : When is the appropriate time to perform table information statistics ? In response to this question , We have tentatively conceived three solutions to information statistics :
● In every time SQL Before query , First, perform the primary table information statistics
The statistical information obtained in this way is more accurate , after CBO The optimized execution plan is also optimal , But the cost of information statistics is the greatest .
● Refresh table statistics regularly
Every time SQL Table information statistics is not required before query , Because of the uncertainty of business data update , So this way SQL The table statistics obtained during query may not be up-to-date , that CBO The optimized execution plan may not be optimal .
● Perform information statistics on the business party that changes the data
This method has the least cost for information statistics , Also can guarantee CBO The optimized execution plan is optimal , But it is the most intrusive to the business code .
It is not difficult to see the advantages and disadvantages of each of the three schemes , Therefore, the specific scheme for table information statistics depends on the architecture design of the platform itself .
The structure diagram of building data warehouse based on data stack platform is shown in the figure below : 
From the structure diagram, we can see that several stacks are useful to Hive、Spark and ChunJun Three components , And these three components can read and write Hive, Several stacks of multiple sub products ( Such as offline platform and real-time platform ) It is also possible to Hive To read and write , So if it's based on a scenario 3 The cost is very high .
programme 1 The price itself is already high , Before each query, make an information statistics , The time of information statistics should be included in the query time , If the amount of table data is large, the increase time may be more than ten minutes or even longer .
Comprehensive consideration , We have chosen a more flexible and reasonable scheme 2 To perform table information statistics . although Spark SQL The statistics obtained at runtime may not be up to date , But in general RBO There is still a great performance improvement .
Let's share , How the data stack collects the information of the original table :
We added the table information statistics function on the offline platform project management page , It ensures that different triggering strategies can be configured for each project according to its own conditions . Trigger policy can be configured to trigger by day or by hour , Triggering by day supports configuration to trigger from a certain time of the day , So as to avoid business peak . After configuration , At the time of triggering, the offline platform will automatically submit a project by project Spark Task to count item table information .
It is not implemented in the stack CBO Before supporting ,Spark SQL Can only be optimized by adjusting Spark Its own parameter implementation . This kind of tuning method has a high entry threshold , Users should be familiar with Spark Principle . Counting stack CBO The introduction of has greatly reduced the learning threshold of users , Users only need to be in Spark Conf In the open
CBO-spark.sql.cbo.enabled=true
Then you can configure table information statistics in the corresponding project SQL To optimize the .
Four 、 Future outlook
stay CBO After continuous research on Optimization ,Spark SQL CBO Compare the whole RBO In terms of performance, it has been greatly improved . But this does not mean that the entire operating system has no room for optimization , The progress we have made will only encourage us to continue our further exploration , Try to take another step forward .
Finish right CBO After the initial support exploration , Several stacks looked at Spark 3.0 New features introduced in the ——AQE(Adaptive Query Execution).
AQE It's dynamic CBO How to optimize , Is in CBO On the basis of SQL Another performance improvement of optimization technology . As said before, ,CBO The current calculation still depends on the information statistics of the original table , And the outdated information statistics will give CBO Bring about no small impact .
If you optimize dynamically at run time SQL Implementation plan , You don't need to be like CBO In that case, you need to do table information statistics in advance . Several stacks are working on this new feature , It is believed that we will be able to introduce AQE, Make the stack a higher level in ease of use and high performance . I hope you will keep your attention , Several stacks are willing to grow with you .
The source of the original :VX official account “ Several stack Study Club ”
Kangaroo cloud open source framework nail technology exchange group (
), Students interested in big data open source projects are welcome to join us to exchange the latest technical information , Open source project library address : https://github.com/DTStack
边栏推荐
- SVM and ANN of OpenCV neural network library_ Use of MLP
- Explore the secrets behind the open source data visualization development platform flyfish!
- 【第14节 STL容器二】
- Jerry's ble PB2 cannot be hardware grounded or connected to triode base [chapter]
- 顺应医改,积极布局——集采背景下的高值医用耗材发展洞察2022
- 2D human pose estimation for pose estimation - simdr: is 2D Heatmap representation even necessity for human pose estimation?
- Join operation cases in the reduce phase of MapReduce
- Aperçu en direct | déconstruire OLAP! Le nouveau paradigme de l'architecture d'analyse multidimensionnelle est entièrement ouvert! Apache Doris va apporter cinq gros problèmes!
- NanoMQ Newsletter 2022-05|v0.8.0 发布,新增 WebHook 拓展接口和连接认证 API
- LocalDate与Date相互转换
猜你喜欢

Server operation and maintenance environment security system (Part 2)

Code implementation of partition case of MapReduce

Code implementation of sorting and serializing cases in MapReduce

this和对象原型

Aggregate sum of MapReduce cases

我用 MATLAB 复刻了抖音爆火小游戏 苹果蛇

Application scenario introduction of nixie tube driver chip + voice chip, wt588e02b-24ss

智能电网终极Buff | 广和通模组贯穿“发、输、变、配、用”全环节

Recommend an easy-to-use designer navigation website

Rk3308--8 channels changed to dual channels + recording gain
随机推荐
【无标题】
智能电网终极Buff | 广和通模组贯穿“发、输、变、配、用”全环节
[object].
2D human pose estimation with residual log likelihood estimation (RLE) [link only]
I used Matlab to reproduce the trembling sonic boom Fire Games Apple snake
MapReduce案例之聚合求和
Two methods of modifying PIP download source
姿态估计之2D人体姿态估计 - Human Pose Regression with Residual Log-likelihood Estimation(RLE)[仅链接]
Code implementation of partition case of MapReduce
Rk3308-- firmware compilation
MapReduce之Map阶段的join操作案例
【第六节 函数】
2D pose estimation for pose estimation - (openpose) realtime multi person 2D pose estimation using part affinity fields
What are the top ten futures companies with low handling fees? Is it safe?
2290. Minimum Obstacle Removal to Reach Corner
Common sense: the number of neurons in the brain of mice is about 70million and that of humans is about 86billion
The ultimate buff of smart grid - guanghetong module runs through the whole process of "generation, transmission, transformation, distribution and utilization"
Sorting of MapReduce cases
Save a window with a specific size, resolution, or background color
广和通高算力智能模组为万亿级市场5G C-V2X注智