This is my participation 「 The Fourth Youth Training Camp 」 The third part of note creation activities 1 God

One 、 Big data system and SQL

1、SQL Process flow

1.1 Parser

●String -> AST (Abstruct Syntax Tree)：

Lexical analysis : Split string , Get the key words 、 Numerical constants 、 String constant 、 Operation symbols, etc token
Syntax analysis : take token form ASTnode, And finally get one AST

● Realization ： Recursive descent (ClickHouse),Flex and Bison (PostgreSQL),JavaCC (Flink),Antlr (Presto,Spark)

1.2 Analyzer and Logical Plan

● Analyzer：

Check and bind Database, Table, Column Meta information
SQL Check the legitimacy of , such as min/max/avg The input of is a value
AST -> Logical Plan

● Logical Plan：

Describe logically SQL Corresponding step-by-step calculation operation
Calculation operation : operator ( operator )

1.3 Physical Plan and Executor

● Physical Plan: Execution plan subtree

The goal is : Minimize network data transmission
Use the physical distribution of the data on ( Data affinity )
increase Shuffle operator

● Executor

Stand alone in parallel : cache,pipeline, SIMD
Multiprocessor parallel : One fragment Corresponding to multiple instances

1.4 summary

One SQL rules big data all
SQL You need to go through Parser,Analyzer,Optimizer and Executor To deal with
Query optimizer is the brain of database , In the big data scenario, it is very important for query performance
The query optimizer needs to be aware of data distribution , Make full use of data affinity
The query optimizer splits the logical plan into multiple physical plan segments according to the goal of minimizing network data transmission

Two 、 Common query optimizers

2.1 Query optimizer classification

2.2 RBO（Rule-based optimizer）

2.2.1 Relational algebra

Operator ：Select Project Join Rename Union
Equivalent transformation ： Associative law 、 Commutative law 、 Transitivity

2.2.2 Optimization principle

2.2.3 RBO- Column cut

Scan the required columns in the table , Not all of them

2.2.4 RBO- Predicate push-down

where The expression of is a predicate . Predicates filter data as soon as possible , To reduce overhead 2（ Conditions ：join yes inter）

2.2.5 RBO- Pass closures

According to the expression equivalence , Filter conditions , A new filtering condition is derived

2.2.6 RBO-Runtime Filter

To a join If you can filter unnecessary data in advance at the query end , Can reduce overhead

min-max The shortcomings of ： The scope must be close
in-list： Just scan in-list The data in . shortcoming ： When there are many sets ,in-list Also great
bloom filter： characteristic ： The size does not change with the size of the set , Fixed size , Give a number to judge whether it is

2.2.7 summary

Main stream RBO Generally, there are hundreds of optimization rules based on experience induction
advantage : Implement a simple , Fast optimization speed
shortcoming : There is no guarantee of an optimal execution plan

2.3 CBO（Cost-based optimizer）

2.3.1 CBO- Concept

△ Use a model to estimate the cost of implementing the plan , Choose the least expensive execution plan

The cost of executing the plan is equal to the sum of the execution costs of all operators
adopt RBO obtain ( all ) Possible equivalent execution plans

△ Operator cost :CPU, Memory , disk IO, The Internet I/O Equal price

Statistics + Derivation rules → Calculate the operator cost → Calculate the cost of executing the plan → Execution plan enumeration

2.3.2 CBO- Statistics

Raw table statistics

Table or partition level : Row number 、 Average row size 、 How many bytes does the table occupy in the disk
Column level : min、max、num nulls、num not nulls、num distinct value(NDV)、histogram etc.

Derive statistics

Selection rate ( selecthwty)： For a certain filter condition, how much data will be returned from the table
base ( careinality )： In query plan, it often refers to the number of rows that the operator needs to process

2.3.2.1 CBO- How to collect statistical information

stay DDL Specify the statistical information to be collected , The database collects or updates statistics when data is written
CREATE TABLE REGION( R_ REGIONKEY INT NOT NULL, R NAME CHAR(25) NOT NULL, R_ COMMENT VARCHAR(152) ) DUPLICATE KEY(R_ REGIONKEY) DISTRIBUTED BY HASH(R_ REGIONKEY) BUCKETS 1 PROPERTIES (" sotumnselelR w");
Do it manually explain analyze statement, Start the database to collect or update statistical information

ANALYZE TABLE table_name COMPUTE STATISICS FOR COLUMNS column-name1,column-name2....

Dynamic sampling

SELECT count(*) FROM table_name

2.3.2.2 CBO- Statistical information derivation rules

Filter Selectivity
- AND Conditions :fs(a AND b)=fs(a)* fs(b)
- OR Conditions : fs(a OR b) = fs(a) + fs(b) - (fs(a) * fs(b))
- NOT Conditions : fs(NOT a)= 1.0 - fs(a)
- It's equal to the condition (x = literal )
  - literal < min && literal > max : 0
  - 1/NDV
Less than condition (x < literal )
- literal<min:0
- literal>max:1
- (literal-min)/(max-min)

2.3.3 CBO- Execution plan enumeration

Single table scan : An index scan ( Random I/O) vs Full table scan ( The order IO)
- If the data queried is very unevenly distributed , Index scanning may not be as good as full table scanning
Join The implementation of the : Hash Join Vs. SortMerge Join
The two tables Hash Join : How to identify small tables by building hash tables with small tables ?
Multiple tables Join :
- Which connection sequence is the best ?
- Whether to explore each combination ?
N A table joins , just left-deep tree It's almost N! A connection sequence
- e.g. N= 10-> in total 3, 628, 800 Connection order

2.3.4 CBO Summary

CBO Use cost models and statistics to estimate the cost of implementing the plan
CBO Use greedy or dynamic programming algorithm to find the optimal execution plan
In the big data scenario CBO It is very important for query performance

2.4 Summary

Main stream RBO Realization - Generally, there are hundreds of optimization rules based on experience induction
RBO Implement a simple , Fast optimization speed
RBO There is no guarantee of an optimal execution plan
CBO Use cost models and statistics to estimate the cost of implementing the plan
CBO Use greedy or dynamic programming algorithm to find the optimal execution plan
In the big data scenario CBO It is very important for query performance

3、 ... and 、 Community open source practice

3.1 Apache Calcite overview

One size fitsall： A unified SQL Query engine
modularization , pluggable , Stable and reliable
Support heterogeneous data models
- Relational type
- Semi structured
- streaming
- Geospatial data
built-in RBO and CBO

3.1.1 Calcite RBO

HepPlanner

Optimize the rules (Rule)
- Pattern : Match expression subtree
- Equivalent transformation : Get a new expression
Built in 100+ Optimize the rules
Four matching rules
- ARBITRARY/DEPTH FIRST : Depth first
- TOP DOWN : Topological order
- BOTTOM_ UP : And TOP_ DOWN contrary
Traverse all rule , Until there is no rule Can be triggered
Fast optimization speed , Implement a simple , But there is no guarantee of optimality

3.1.2 Calcite CBO

VolcanoPlanner

be based on Wolcano/Cascade frame
The cost optimal Hypothesis
Memo : Store candidate execution plans
- Group : Equivalent plan set
Top-down Dynamic planning search
application Rule Search for candidate plans
Memo
- The essence : AND/OR graph
- Sharing subtrees reduces memory overhead
Group winner: The best plan at present
prune ： Reduce search space
Top-down Traverse ： choice winner Build the best execution plan

3.1.3 Summary

Mainstream query optimizers include RBO and CBO
Apache Calcite It is a very popular query optimizer in the field of big data
Apache Calcite RBO Many optimization rules are defined , Use pattern Matching subtree , Perform equivalent transformation
Apache Calcite CBO be based on Volcano/Cascade frame
Volcano/Cascade The essence of : Memo、 Dynamic programming 、 prune

Four 、 Cutting edge trends

4.1 AI4DB

Self configuration
- Intelligent tuning ( OtterTune , QTune )
- Load forecasting / Dispatch
Self diagnosis and self healing : Error recovery and migration
Self optimization :
- Statistical information estimation ( Learned cardinalities )
- Cost estimate
- Learning optimizer ( IBM DB2 LEQ )
- Indexes / View recommendation

4.2 DB4AI

Embedded artificial intelligence algorithm ( MLSQL,SOLFlow )
Embedded machine learning framework ( SparkML,Alink,dl-on-fink )

4.3 Summary

Big data entrepreneurship is in full swing , SQL Query optimizer is still an essential and important component
Evolution of engine architecture 、 Cloud native 、 Lake warehouse integration, etc SQL Query optimizer has new requirements and challenges
AI The blessing , Learning query optimizers are evolving