当前位置:网站首页>Opengauss kernel analysis - statistics and row count estimation
Opengauss kernel analysis - statistics and row count estimation
2022-07-27 11:12:00 【Gauss squirrel Club】
Catalog
SQL The engine performs query mainly through lexical and grammatical parsing 、 Query rewriting 、 Query Planning and plan execution . among , In the process of Query Planning , To generate an executable optimal plan , First, generate the path , And because of the diversity of paths , Therefore, the path needs to be eliminated . At present, the path selection of the optimizer is mainly based on the estimated cost , So this kind of optimizer is also called cost based optimizer (Cost Based Optimization, CBO). Relative to logical optimization , This optimization method is physical optimization : According to the distribution of data ( Statistics ) To evaluate the query execution path , Select a path with the least execution cost from the optional paths to execute , For example, whether to select an index SeqScan vs. IndexScan, Choose which index , What kind of connection order is selected for the association of two tables , What specific algorithm to choose .
When estimating the cost , The number of rows that need to use the base table or the join table , And a lot of the time , The optimizer cannot get the exact row value , Therefore, the number of rows needs to be estimated (Cardinality Estimation), Then calculate the cost .
Statistics
Statistics are the basis for physical optimization , Statistics from table information . The characteristics describing the base table data include unique values 、MCV(Most Common Value) It's worth waiting for , For line count estimation .
Table-Level Table level statistics , Stored in the system table pg_class.
relptuples Total tuples : Describes the number of tuples corresponding to the table .
relpages Total pages : The number of disk pages corresponding to the description table .
Column-Level Column level statistics , Stored in the system table pg_statistics, You can also use views pg_stats View the data .
Starelid: Tabular oid.
Staattnum: Table attribute number .
stadistinct: It is used to describe the only non - in the field NULL Number of data values , It is generally used to estimate the size of a set after grouping ,Join Result set size .
stanullfrac: Used to describe... In the current column NULL The percentage of value in the total .
Attribute group {stakind1, stanumbers1, stavalues1} constitute PG_STATISTIC A card slot in the watch , stay PG_STATISTIC Table has 5 Slots . In general , The first card slot stores MCV(Most Common Value) Information : Describes a set of values that occur more frequently than a certain percentage , Sort according to the frequency of occurrence , Usually used to indicate which values are skewed . The second card slot stores Histogram Histogram information , Describe except NULL value 、MCV The distribution of values other than , It is generally used to estimate the selection rate .
With MCV Card slot as an example attribute “stakind1” The type of identification card slot is MCV, among “1” by “STATISTIC_KIND_MCV” The enumerated values ; attribute stanumbers1 And properties stavalues1 Record MCV Specific content of , among stavalues1 Record key value ,stanumbers1 Record key Corresponding frequency .
The system tables pg_statistics The definition of is in the file pg_statistic.h in .
#define STATISTIC_KIND_MCV 1#define STATISTIC_KIND_HISTOGRAM 2#define STATISTIC_KIND_CORRELATION 3#define STATISTIC_KIND_MCELEM 4#define STATISTIC_KIND_DECHIST 5Statistics are provided through analyze Command to get .



surface tt Of oid by 40960, Yes 10000 Row data occupied 345 individual pages page . The first 1 Column unique1 The distribution of can be obtained from histogram information , Histogram has 100 Intervals , And there are no null values and MCV. The first 16 Column string4 The distribution of can be determined by MCV information acquisition , This column has 4 individual distinct value ”AAAAxx” ,”HHHHxx” , “OOOOxx” , “VVVVxx” ,4 The distribution frequency of each value has 0.25.
Row count estimation
Line count estimation is the basis of cost estimation , Extrapolation from base table statistics , Estimate base table baserel、Join Intermediate result set joinrel、Aggregation Result set size in , Prepare for cost estimation .
SQL Queries often have where constraint ( Filter conditions ), such as SELECT * FROM tt WHERE string4 = 'AAAAxx'. Knowing the selection rate of constraints , That is, we know the proportion of the results to be scanned through the scanning path or the proportion of tuples obtained through the connection operation , From this ratio, we can calculate the number of intermediate results and final results , These quantities are then used to calculate the cost .
Here we focus on the simple query of the base table —— be based on OpExpr Type selection rate calculation , The processing function is in the clause_selectivity. If it is a filter condition, call restriction_selectivity Function to get OpExpr Selection rate of expression , If it is a connection condition, call join_selectivity Function to get the selection rate .
SELECT * FROM tt WHERE string4 = 'AAAAxx' For filter conditions , call restriction_selectivity Estimate the selection rate .



restriction_selectivity The function recognizes string4 = 'AAAAxx' Is shaped like Var = Const Equivalence constraint , The constraint selective evaluation function of the operator is stored in the system table PG_OPERATOR,opno = 93 The corresponding selection rate calculation function is eqsel, adopt eqsel Function call var_eq_const Function to estimate the selection rate . In the process ,var_eq_const The function reads PG_STATISTIC In the table string4 Column distribution information , And make use of MCV The selection rate for direct return of information is 0.25.

function set_baserel_size_estimates Calculate the estimated number of rows .


Function call relationship :standard_planner-> subquery_planner-> grouping_planner-> query_planner-> make_one_rel-> set_base_rel_sizes-> set_rel_size-> set_plain_rel_size-> set_baserel_size_estimates-> clauselist_selectivity-> clause_selectivity-> restriction_selectivity-> OidFunctionCall4Coll-> eqsel->var_eq_const
边栏推荐
- The influence of the number of non-zero values in the picture on Classification
- 熵与形态的非递进现象
- 学习笔记-uni-app
- Thank you for your likes and attention
- img src为空或者src不存在,图片出现白色边框
- [FPGA tutorial case 40] communication case 10 -- Verilog implementation of a simple OFDM system based on FPGA
- [QNX Hypervisor 2.2用户手册]9.9 logger
- 学习笔记-简易服务器实现
- KEPServer配置
- IO stream_ Overview and explanation of data input and output flow
猜你喜欢

荒野觅踪---寻找迭代次数

Why is the data service API the standard configuration of the data midrange when we take the last mile of the data midrange?

Solved syntaxerror: (Unicode error) 'Unicode scape' codec can't decode bytes in position 2-3: truncated

Chengying, kangaroo cloud one-stop fully automated operation and maintenance steward, is officially open source

What is the mystery of the gate of the meta universe?

antd table中排序th阻止悬停变色+table悬停行变色+table表头变色

SQL Server2000 database error

Yonbuilder enables innovation, and the "golden keyboard Award" of the fourth UFIDA developer competition is open!

The difference between scalar, vector, matrix and tensor in deep learning

Record of a cross domain problem
随机推荐
解决 ImportError: cannot import name 'abs' 导入tensorflow报错
Redis+caffeine two-level cache enables smooth access speed
Application scenarios, key technologies and network architecture of communication perception integration
Chunjun supports DDL conversion and automatic execution of heterogeneous data sources - dtmo 02 review (including course playback + courseware)
tensorflow运行报错解决方法
Cancer DDD
Maximized array sum after 13 K negations
Analysis of C language pointer function and function pointer
SQL Server2000数据库错误
DNS principle and resolution process
12 is at least twice the maximum number of other numbers
Thank you for your likes and attention
[FPGA tutorial case 40] communication case 10 -- Verilog implementation of a simple OFDM system based on FPGA
Iptables prevent nmap scanning and binlog explanation
No identifier specified for entity solution
Tdengine helps Siemens' lightweight digital solution simicas simplify data processing process
Chengying, kangaroo cloud one-stop fully automated operation and maintenance steward, is officially open source
How to create a.Net image with diagnostic tools
KEPServer配置
C语言 2:求三数字最大值,求三数字中间值,编写程序步骤