当前位置:网站首页>SQL optimizer parsing | youth training camp notes
SQL optimizer parsing | youth training camp notes
2022-07-25 18:04:00 【Red bean milk tea with less ice and half sugar】
This is my participation 「 The Fourth Youth Training Camp 」 The third part of note creation activities 1 God
One 、 Big data system and SQL
1、SQL Process flow
1.1 Parser
●String -> AST (Abstruct Syntax Tree):
- Lexical analysis : Split string , Get the key words 、 Numerical constants 、 String constant 、 Operation symbols, etc token
- Syntax analysis : take token form ASTnode, And finally get one AST
● Realization : Recursive descent (ClickHouse),Flex and Bison (PostgreSQL),JavaCC (Flink),Antlr (Presto,Spark)
1.2 Analyzer and Logical Plan
● Analyzer:
- Check and bind Database, Table, Column Meta information
- SQL Check the legitimacy of , such as min/max/avg The input of is a value
- AST -> Logical Plan
● Logical Plan:
- Describe logically SQL Corresponding step-by-step calculation operation
- Calculation operation : operator ( operator )
1.3 Physical Plan and Executor
● Physical Plan: Execution plan subtree
- The goal is : Minimize network data transmission
- Use the physical distribution of the data on ( Data affinity )
- increase Shuffle operator
● Executor
- Stand alone in parallel : cache,pipeline, SIMD
- Multiprocessor parallel : One fragment Corresponding to multiple instances
1.4 summary
One SQL rules big data all
SQL You need to go through Parser,Analyzer,Optimizer and Executor To deal with
Query optimizer is the brain of database , In the big data scenario, it is very important for query performance
The query optimizer needs to be aware of data distribution , Make full use of data affinity
The query optimizer splits the logical plan into multiple physical plan segments according to the goal of minimizing network data transmission
Two 、 Common query optimizers
2.1 Query optimizer classification
2.2 RBO(Rule-based optimizer)
2.2.1 Relational algebra
- Operator :Select Project Join Rename Union
- Equivalent transformation : Associative law 、 Commutative law 、 Transitivity
2.2.2 Optimization principle
2.2.3 RBO- Column cut
- Scan the required columns in the table , Not all of them
2.2.4 RBO- Predicate push-down
- where The expression of is a predicate . Predicates filter data as soon as possible , To reduce overhead 2( Conditions :join yes inter)
2.2.5 RBO- Pass closures
- According to the expression equivalence , Filter conditions , A new filtering condition is derived
2.2.6 RBO-Runtime Filter
To a join If you can filter unnecessary data in advance at the query end , Can reduce overhead
- min-max The shortcomings of : The scope must be close
- in-list: Just scan in-list The data in . shortcoming : When there are many sets ,in-list Also great
- bloom filter: characteristic : The size does not change with the size of the set , Fixed size , Give a number to judge whether it is
2.2.7 summary
- Main stream RBO Generally, there are hundreds of optimization rules based on experience induction
- advantage : Implement a simple , Fast optimization speed
- shortcoming : There is no guarantee of an optimal execution plan
2.3 CBO(Cost-based optimizer)
2.3.1 CBO- Concept
△ Use a model to estimate the cost of implementing the plan , Choose the least expensive execution plan
- The cost of executing the plan is equal to the sum of the execution costs of all operators
- adopt RBO obtain ( all ) Possible equivalent execution plans
△ Operator cost :CPU, Memory , disk IO, The Internet I/O Equal price
Statistics + Derivation rules → Calculate the operator cost → Calculate the cost of executing the plan → Execution plan enumeration
2.3.2 CBO- Statistics
Raw table statistics
- Table or partition level : Row number 、 Average row size 、 How many bytes does the table occupy in the disk
- Column level : min、max、num nulls、num not nulls、num distinct value(NDV)、histogram etc.
Derive statistics
- Selection rate ( selecthwty): For a certain filter condition, how much data will be returned from the table
- base ( careinality ): In query plan, it often refers to the number of rows that the operator needs to process
2.3.2.1 CBO- How to collect statistical information
stay DDL Specify the statistical information to be collected , The database collects or updates statistics when data is written
CREATE TABLE REGION( R_ REGIONKEY INT NOT NULL, R NAME CHAR(25) NOT NULL, R_ COMMENT VARCHAR(152) ) DUPLICATE KEY(R_ REGIONKEY) DISTRIBUTED BY HASH(R_ REGIONKEY) BUCKETS 1 PROPERTIES (" sotumnselelR w");
Do it manually explain analyze statement, Start the database to collect or update statistical information
ANALYZE TABLE table_name COMPUTE STATISICS FOR COLUMNS column-name1,column-name2....
- Dynamic sampling
SELECT count(*) FROM table_name
2.3.2.2 CBO- Statistical information derivation rules
- Filter Selectivity
- AND Conditions :fs(a AND b)=fs(a)* fs(b)
- OR Conditions : fs(a OR b) = fs(a) + fs(b) - (fs(a) * fs(b))
- NOT Conditions : fs(NOT a)= 1.0 - fs(a)
- It's equal to the condition (x = literal )
- literal < min && literal > max : 0
- 1/NDV
- Less than condition (x < literal )
- literal<min:0
- literal>max:1
- (literal-min)/(max-min)
2.3.3 CBO- Execution plan enumeration
- Single table scan : An index scan ( Random I/O) vs Full table scan ( The order IO)
- If the data queried is very unevenly distributed , Index scanning may not be as good as full table scanning
- Join The implementation of the : Hash Join Vs. SortMerge Join
- The two tables Hash Join : How to identify small tables by building hash tables with small tables ?
- Multiple tables Join :
- Which connection sequence is the best ?
- Whether to explore each combination ?
- N A table joins , just left-deep tree It's almost N! A connection sequence
- e.g. N= 10-> in total 3, 628, 800 Connection order
2.3.4 CBO Summary
- CBO Use cost models and statistics to estimate the cost of implementing the plan
- CBO Use greedy or dynamic programming algorithm to find the optimal execution plan
- In the big data scenario CBO It is very important for query performance
2.4 Summary
- Main stream RBO Realization - Generally, there are hundreds of optimization rules based on experience induction
- RBO Implement a simple , Fast optimization speed
- RBO There is no guarantee of an optimal execution plan
- CBO Use cost models and statistics to estimate the cost of implementing the plan
- CBO Use greedy or dynamic programming algorithm to find the optimal execution plan
- In the big data scenario CBO It is very important for query performance
3、 ... and 、 Community open source practice
3.1 Apache Calcite overview
- One size fitsall: A unified SQL Query engine
- modularization , pluggable , Stable and reliable
- Support heterogeneous data models
- Relational type
- Semi structured
- streaming
- Geospatial data
- built-in RBO and CBO
3.1.1 Calcite RBO
HepPlanner
- Optimize the rules (Rule)
- Pattern : Match expression subtree
- Equivalent transformation : Get a new expression
- Built in 100+ Optimize the rules
- Four matching rules
- ARBITRARY/DEPTH FIRST : Depth first
- TOP DOWN : Topological order
- BOTTOM_ UP : And TOP_ DOWN contrary
- Traverse all rule , Until there is no rule Can be triggered
- Fast optimization speed , Implement a simple , But there is no guarantee of optimality
3.1.2 Calcite CBO
VolcanoPlanner
- be based on Wolcano/Cascade frame
- The cost optimal Hypothesis
- Memo : Store candidate execution plans
- Group : Equivalent plan set
- Top-down Dynamic planning search
- application Rule Search for candidate plans
- Memo
- The essence : AND/OR graph
- Sharing subtrees reduces memory overhead
- Group winner: The best plan at present
- prune : Reduce search space
- Top-down Traverse : choice winner Build the best execution plan
3.1.3 Summary
- Mainstream query optimizers include RBO and CBO
- Apache Calcite It is a very popular query optimizer in the field of big data
- Apache Calcite RBO Many optimization rules are defined , Use pattern Matching subtree , Perform equivalent transformation
- Apache Calcite CBO be based on Volcano/Cascade frame
- Volcano/Cascade The essence of : Memo、 Dynamic programming 、 prune
Four 、 Cutting edge trends
4.1 AI4DB
- Self configuration
- Intelligent tuning ( OtterTune , QTune )
- Load forecasting / Dispatch
- Self diagnosis and self healing : Error recovery and migration
- Self optimization :
- Statistical information estimation ( Learned cardinalities )
- Cost estimate
- Learning optimizer ( IBM DB2 LEQ )
- Indexes / View recommendation
4.2 DB4AI
- Embedded artificial intelligence algorithm ( MLSQL,SOLFlow )
- Embedded machine learning framework ( SparkML,Alink,dl-on-fink )
4.3 Summary
- Big data entrepreneurship is in full swing , SQL Query optimizer is still an essential and important component
- Evolution of engine architecture 、 Cloud native 、 Lake warehouse integration, etc SQL Query optimizer has new requirements and challenges
- AI The blessing , Learning query optimizers are evolving
5、 ... and 、 summary
边栏推荐
- Unity 贝塞尔曲线的创建
- WPF implements user avatar selector
- Drawing PDF tables (I) drawing excel table styles in PDF and downloading them through iText (supporting Chinese fonts)
- C语言 cJSON库的使用
- Nineteen year old summary
- List转换问题
- Introduction to cloud XR and development opportunities of cloud XR in 5g Era
- imx6 RTL8189FTV移植
- Auditing相关注解
- Auditing related notes
猜你喜欢

Landmark buildings around the world

Update 3dcat real time cloud rendering V2.1.2 release

云流化和云桌面有什么关系

"Digital security" alert NFT's seven Scams

Installation steps and usage of NVM under windows10 system

TME2022校园招聘后台开发/运营开发/业务运维/应用开发笔试(I)编程题的一点自我分析

What are the advantages of real-time cloud rendering
![[MySQL]数据库中的索引为什么是用B+树来实现? 哈希表/红黑树/B树是否可行呢?](/img/1f/a2d50ec6bc97d52c1e7566a42e564b.png)
[MySQL]数据库中的索引为什么是用B+树来实现? 哈希表/红黑树/B树是否可行呢?

Cloud XR面临的问题以及Cloud XR主要应用场景

STM8S003F3 内部flash调试
随机推荐
Landmark buildings around the world
有没有什么不起眼却挣钱的副业?
How many points did NPDP pass? How to pass with high scores?
CH582 BLE 5.0 使用 LE Coded 广播和连接
专访即构科技李凯:音视频的有趣、行业前沿一直吸引着我
"Digital security" alert NFT's seven Scams
MySQL optimistic lock
国际权威认可!OceanBase入选Forrester Translytical数据平台报告
排序还需要了解的信息以及链表
Redis source code and design analysis -- 16. AOF persistence mechanism
大话DevOps监控,团队如何选择监控工具?
OV7725 yuv 640*[email protected] 配置文件
Sequential storage structure, chain storage structure and implementation of stack
Wu Enda's machine learning programming operation cannot be suspended pause problem solved
P2P 之 UDP穿透NAT的原理与实现
十九岁的总结
Tme2022 campus recruitment background development / operation development / business operation and maintenance / application development written examination (I) a little self analysis of programming q
testng执行顺序的3中控制方法
[MySQL]数据库中的索引为什么是用B+树来实现? 哈希表/红黑树/B树是否可行呢?
SDLC software development life cycle and model