当前位置:网站首页>Cloud native database query optimization - statistics and row count estimation
Cloud native database query optimization - statistics and row count estimation
2022-06-29 21:41:00 【Gauss squirrel Club】
Catalog
SQL The engine performs query mainly through lexical and grammatical parsing 、 Query rewriting 、 Query Planning and plan execution . among , In the process of Query Planning , To generate an executable optimal plan , First, generate the path , And because of the diversity of paths , Therefore, the path needs to be eliminated . At present, the path selection of the optimizer is mainly based on the estimated cost , So this kind of optimizer is also called cost based optimizer (Cost Based Optimization, CBO). Relative to logical optimization , This optimization method is physical optimization : According to the distribution of data ( Statistics ) To evaluate the query execution path , Select a path with the least execution cost from the optional paths to execute , For example, whether to select an index SeqScan vs. IndexScan, Choose which index , What kind of connection order is selected for the association of two tables , What specific algorithm to choose .
When estimating the cost , The number of rows that need to use the base table or the join table , And a lot of the time , The optimizer cannot get the exact row value , Therefore, the number of rows needs to be estimated (Cardinality Estimation), Then calculate the cost .
Statistics
Statistics are the basis for physical optimization , Statistics from table information . The characteristics describing the base table data include unique values 、MCV(Most Common Value) It's worth waiting for , For line count estimation .
Table-Level Table level statistics , Stored in the system table pg_class.
relptuples Total tuples : Describes the number of tuples corresponding to the table .
relpages Total pages : The number of disk pages corresponding to the description table .
Column-Level Column level statistics , Stored in the system table pg_statistics, You can also use views pg_stats View the data .
Starelid: Tabular oid.
Staattnum: Table attribute number .
stadistinct: It is used to describe the only non - in the field NULL Number of data values , It is generally used to estimate the size of a set after grouping ,Join Result set size .
stanullfrac: Used to describe... In the current column NULL The percentage of value in the total .
Attribute group {stakind1, stanumbers1, stavalues1} constitute PG_STATISTIC A card slot in the watch , stay PG_STATISTIC Table has 5 Slots . In general , The first card slot stores MCV(Most Common Value) Information : Describes a set of values that occur more frequently than a certain percentage , Sort according to the frequency of occurrence , Usually used to indicate which values are skewed . The second card slot stores Histogram Histogram information , Describe except NULL value 、MCV The distribution of values other than , It is generally used to estimate the selection rate .
With MCV Card slot as an example attribute “stakind1” The type of identification card slot is MCV, among “1” by “STATISTIC_KIND_MCV” The enumerated values ; attribute stanumbers1 And properties stavalues1 Record MCV Specific content of , among stavalues1 Record key value ,stanumbers1 Record key Corresponding frequency .
The system tables pg_statistics The definition of is in the file pg_statistic.h in .
#define STATISTIC_KIND_MCV 1
#define STATISTIC_KIND_HISTOGRAM 2
#define STATISTIC_KIND_CORRELATION 3
#define STATISTIC_KIND_MCELEM 4
#define STATISTIC_KIND_DECHIST 5Statistics are provided through analyze Command to get .



surface tt Of oid by 40960, Yes 10000 Row data occupied 345 individual pages page . The first 1 Column unique1 The distribution of can be obtained from histogram information , Histogram has 100 Intervals , And there are no null values and MCV. The first 16 Column string4 The distribution of can be determined by MCV information acquisition , This column has 4 individual distinct value ”AAAAxx” ,”HHHHxx” , “OOOOxx” , “VVVVxx” ,4 The distribution frequency of each value has 0.25.
Row count estimation
Line count estimation is the basis of cost estimation , Extrapolation from base table statistics , Estimate base table baserel、Join Intermediate result set joinrel、Aggregation Result set size in , Prepare for cost estimation .
SQL Queries often have where constraint ( Filter conditions ), such as SELECT * FROM tt WHERE string4 = 'AAAAxx'. Knowing the selection rate of constraints , That is, we know the proportion of the results to be scanned through the scanning path or the proportion of tuples obtained through the connection operation , From this ratio, we can calculate the number of intermediate results and final results , These quantities are then used to calculate the cost .
Here we focus on the simple query of the base table —— be based on OpExpr Type selection rate calculation , The processing function is in the clause_selectivity. If it is a filter condition, call restriction_selectivity Function to get OpExpr Selection rate of expression , If it is a connection condition, call join_selectivity Function to get the selection rate .
SELECT * FROM tt WHERE string4 = 'AAAAxx' For filter conditions , call restriction_selectivity Estimate the selection rate .



restriction_selectivity The function recognizes string4 = 'AAAAxx' Is shaped like Var = Const Equivalence constraint , The constraint selective evaluation function of the operator is stored in the system table PG_OPERATOR,opno = 93 The corresponding selection rate calculation function is eqsel, adopt eqsel Function call var_eq_const Function to estimate the selection rate . In the process ,var_eq_const The function reads PG_STATISTIC In the table string4 Column distribution information , And make use of MCV The selection rate for direct return of information is 0.25.

function set_baserel_size_estimates Calculate the estimated number of rows .

Function call relationship :standard_planner-> subquery_planner-> grouping_planner-> query_planner-> make_one_rel-> set_base_rel_sizes-> set_rel_size-> set_plain_rel_size-> set_baserel_size_estimates-> clauselist_selectivity-> clause_selectivity-> restriction_selectivity-> OidFunctionCall4Coll-> eqsel->var_eq_const
边栏推荐
- Reinforcement learning weekly (issue 51): integration of PAC, ilql, RRL & model free reinforcement learning into micro grid control: overview and Enlightenment
- 唯品会关键词搜索API接口(item_search-按关键字搜索唯品会商品API接口),唯品会API接口
- Water polo chart - using dynamic ripples to show percentages
- Verilog realizes serial communication and sends it to the nixie tube
- The foundation and application of quantum machine learning: a concise literature review
- 什么是 SYN 洪水攻击?如何防护?
- MES系统与ERP如何集成?本文告诉你答案
- STL tutorial 6-deque, stack, queue, list container
- Realization of graduation project topic selection system based on JSP
- 铝板AS/NZS 1530.1 不燃性材料的阻燃测试
猜你喜欢

Verilog realizes serial communication and sends it to the nixie tube

什么是 SYN 洪水攻击?如何防护?

Implementation and Simulation of ads131a04 ADC Verilog

Threejs basic introduction

Recruit | DBA Data Engineer every week with an annual salary of 35+. Dream of Kyushu and bright stars!
【云原生实战】KubeSphere实战——多租户系统实战

How can colleges and universities build future oriented smart campus based on cloud native? Full stack cloud native vs traditional technology architecture

leetcode:724. Find the central subscript of the array
![Navigation experiment [microcomputer principle] [experiment]](/img/79/8311a409113331e72f650a83351b46.png)
Navigation experiment [microcomputer principle] [experiment]

铝板AS/NZS 1530.1 不燃性材料的阻燃测试
随机推荐
How do new shareholders open accounts online? Is it safe to open an account online?
String字符串的存储原理
A. Print a Pedestal (Codeforces logo?)
How to judge the quality of conductive slip ring from its appearance
Getting started with completabilefuture
Viewing technological changes through Huawei Corps (V): smart Park
PostgreSQL每周新闻—6月22日
Basic qualities of management personnel
BUAA OO unit 4 HW16 unit 4 Summary and course review
透过华为军团看科技之变(五):智慧园区
铝板AS/NZS 1530.1 不燃性材料的阻燃测试
小型圖書館項目總結
[cloud native practice] kubesphere practice - Multi tenant system practice
Report delivery engineer
LeetCode 1. Sum of two numbers
STL tutorial 6-deque, stack, queue, list container
Golang operation etcd
Layer 3 loop brought by route Summary - solution experiment
Water polo chart - using dynamic ripples to show percentages
Realize inotify and Rsync real-time backup