当前位置:网站首页>Simple query cost estimation
Simple query cost estimation
2022-07-05 18:39:00 【InfoQ】
Query cost estimate —— How to choose an optimal execution path
- Lexical analysis: Describe the lexical analyzer *.l Document management Lex Tool build lex.yy.c, Again by C The compiler generates an executable lexical analyzer . The basic function is to set a bunch of strings according to the reserved keywords and non reserved keywords , Convert to the corresponding identifier (Tokens).
- Syntax analysis: Syntax rule description file *.y the YACC Tool build gram.c, Again by C The compiler generates an executable parser . The basic function is to put a bunch of identifiers (Tokens) According to the set grammar rules , Into the original syntax tree .
- Analysis rewrite: Query analysis transforms the original syntax tree into a query syntax tree ( Various transform); Query rewriting is based on pg_rewrite Rules in rewrite tables and views , Finally, we get the query syntax tree .
- Query optimization: After logical optimization and physical optimization ( Generate the optimal path ).
- Query plan generation: Transform the optimal query path into a query plan .
- Query execution: Execute the query plan through the executor to generate the result set of the query .
Why do you need these statistics , Is this enough ?
- Is it an empty array
- Is it a constant array
- Is it the only one without repetition
- Is it orderly
- Is it monotonous , Isochromatic , Equivalency
- What are the most frequent numbers
- Distribution law of data ( The data growth trend is depicted in the dotted way )
How to estimate query cost based on statistical information
- The statistics are only from part of the sampled data , Can not accurately describe the global characteristics .
- Statistics are historical data , The actual data may change at any time .
The simplest table row count estimate ( Use relpages and reltuples)
SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1'; --- Sampling estimation has 358 A page ,10000 Bar tuple
relpages | reltuples
----------+-----------
358 | 10000
EXPLAIN SELECT * FROM tenk1; --- Query and estimate that the table has 10000 Bar tuple
QUERY PLAN
-------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
- First, calculate the actual number of pages by the size of the physical file of the table actual_pages = file_size / BLOCKSIZE
- Calculate the tuple density of the page
- a. If relpages Greater than 0, density = reltuples / (double) relpages
- b. If relpages It's empty ,density = (BLCKSZ - SizeOfPageHeaderData) / tuple_width, Page size / Tuple width .
- Estimate the number of table tuples = Page tuple density * Actual pages = density * actual_pages
The simplest range comparison ( Use histogram )
SELECT histogram_bounds FROM pg_stats WHERE tablename='tenk1' AND attname='unique1';
histogram_bounds
------------------------------------------------------
{0,993,1997,3050,4040,5036,5957,7057,8029,9016,9995}
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000; --- Estimate less than 1000 There are 1007 strip
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=24.06..394.64 rows=1007 width=244)
Recheck Cond: (unique1 < 1000)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
Index Cond: (unique1 < 1000)
Selection rate = ( Total number of barrels in front + The range of the target value in the current bucket / The range of the current bucket ) / Total number of barrels
selectivity = (1 + (1000 - bucket[2].min) / (bucket[2].max - bucket[2].min)) / num_buckets
= (1 + (1000 - 993) / (1997 - 993)) / 10
= 0.100697
Estimate the number of tuples = Number of base table tuples * Conditional selection rate
rows = rel_cardinality * selectivity
= 10000 * 0.100697
= 1007 (rounding off)
The simplest equivalent comparison ( Use MCV)
SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename='tenk1' AND attname='stringu1';
null_frac | 0
n_distinct | 676
most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'CRAAAA'; ---
QUERY PLAN
----------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=30 width=244)
Filter: (stringu1 = 'CRAAAA'::name)
CRAAAA be located MCV No 3 term , The proportion is 0.003
selectivity = mcf[3]
= 0.003
rows = 10000 * 0.003
= 30
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx'; --- Search does not exist in MCV The value in
QUERY PLAN
----------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=15 width=244)
Filter: (stringu1 = 'xxx'::name)
selectivity = (1 - sum(mvf)) / (num_distinct - num_mcv)
= (1 - (0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003 +
0.003 + 0.003 + 0.003 + 0.003)) / (676 - 10)
= 0.0014559
rows = 10000 * 0.0014559
= 15 (rounding off)
The more complicated range is ( Use at the same time MCV And histogram )
SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats WHERE tablename='tenk1' AND attname='stringu1';
null_frac | 0
n_distinct | 676
most_common_vals | {EJAAAA,BBAAAA,CRAAAA,FCAAAA,FEAAAA,GSAAAA,JOAAAA,MCAAAA,NAAAAA,WGAAAA}
most_common_freqs | {0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003,0.003}
SELECT histogram_bounds FROM pg_stats WHERE tablename='tenk1' AND attname='stringu1';
histogram_bounds
--------------------------------------------------------------------------------
{AAAAAA,CQAAAA,FRAAAA,IBAAAA,KRAAAA,NFAAAA,PSAAAA,SGAAAA,VAAAAA,XLAAAA,ZZAAAA}
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 < 'IAAAAA'; --- Find one that does not exist in MCV The value in
QUERY PLAN
------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..483.00 rows=3077 width=244)
Filter: (stringu1 < 'IAAAAA'::name)
selectivity = sum(relevant mvfs)
= 0.00333333 + 0.003 + 0.003 + 0.003 + 0.003 + 0.003
= 0.01833333
selectivity = mcv_selectivity + histogram_selectivity * histogram_fraction
= 0.01833333 + ((2 + ('IAAAAA'-'FRAAAA')/('IBAAAA'-'FRAAAA')) / 10) * (1 - sum(mvfs))
= 0.01833333 + 0.298387 * 0.96966667
= 0.307669
rows = 10000 * 0.307669
= 3077 (rounding off) Copy
Example of multi condition joint query
EXPLAIN SELECT * FROM tenk1 WHERE unique1 < 1000 AND stringu1 = 'xxx';
QUERY PLAN
--------------------------------------------------------------------------------
Bitmap Heap Scan on tenk1 (cost=23.80..396.91 rows=1 width=244)
Recheck Cond: (unique1 < 1000)
Filter: (stringu1 = 'xxx'::name)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..23.80 rows=1007 width=0)
Index Cond: (unique1 < 1000)
selectivity = selectivity(unique1 < 1000) * selectivity(stringu1 = 'xxx')
= 0.100697 * 0.0014559
= 0.0001466
rows = 10000 * 0.0001466
= 1 (rounding off)
A use JOIN Example
EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2;
QUERY PLAN
--------------------------------------------------------------------------------------
Nested Loop (cost=4.64..456.23 rows=50 width=488)
-> Bitmap Heap Scan on tenk1 t1 (cost=4.64..142.17 rows=50 width=244)
Recheck Cond: (unique1 < 50)
-> Bitmap Index Scan on tenk1_unique1 (cost=0.00..4.63 rows=50 width=0)
Index Cond: (unique1 < 50)
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..6.27 rows=1 width=244)
Index Cond: (unique2 = t1.unique2)
selectivity = (0 + (50 - bucket[1].min)/(bucket[1].max - bucket[1].min))/num_buckets
= (0 + (50 - 0)/(993 - 0))/10
= 0.005035
rows = 10000 * 0.005035
= 50 (rounding off)
SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
tablename | null_frac | n_distinct | most_common_vals
-----------+-----------+------------+------------------
tenk1 | 0 | -1 |
tenk2 | 0 | -1 |
【*】 Selection rate = Multiply the non empty probabilities on both sides , Divide by the maximum JOIN Number of pieces .
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
= (1 - 0) * (1 - 0) / max(10000, 10000)
= 0.0001
【*】JOIN The function estimation of is the Cartesian product of the input quantities on both sides , Multiply by the selection rate
rows = (outer_cardinality * inner_cardinality) * selectivity
= (50 * 10000) * 0.0001
= 50
More details
边栏推荐
- How to automatically install pythn third-party libraries
- 进程间通信(IPC):共享内存
- Find in MySQL_ in_ Detailed explanation of set() function usage
- Solutions contents have differences only in line separators
- 如何写出好代码 - 防御式编程
- Case sharing | integrated construction of data operation and maintenance in the financial industry
- sample_rate(采樣率),sample(采樣),duration(時長)是什麼關系
- Exemple Quelle est la relation entre le taux d'échantillonnage, l'échantillon et la durée?
- websocket 工具的使用
- lombok @Builder注解
猜你喜欢
SAP feature description
How to write good code defensive programming
buuctf-pwn write-ups (9)
进程间通信(IPC):共享内存
Tupu software digital twin | visual management system based on BIM Technology
ConvMAE(2022-05)
buuctf-pwn write-ups (9)
Nacos distributed transactions Seata * * install JDK on Linux, mysql5.7 start Nacos configure ideal call interface coordination (nanny level detail tutorial)
Solutions contents have differences only in line separators
About statistical power
随机推荐
【在优麒麟上使用Electron开发桌面应】
LeetCode 6111. 螺旋矩阵 IV
Cronab log: how to record the output of my cron script
Let more young people from Hong Kong and Macao know about Nansha's characteristic cultural and creative products! "Nansha kylin" officially appeared
音视频包的pts,dts,duration的由来.
金太阳开户安全吗?万一免5开户能办理吗?
LeetCode 6109. Number of people who know the secret
技术分享 | 接口测试价值与体系
【Autosar 十四 启动流程详解】
Einstein sum einsum
2022年阿里Android高级面试题分享,2022阿里手淘Android面试题目
Linear table - abstract data type
蚂蚁集团开源可信隐私计算框架「隐语」:开放、通用
LeetCode 6111. Spiral matrix IV
关于服装ERP,你想知道的都在这里了
The easycvr platform reports an error "ID cannot be empty" through the interface editing channel. What is the reason?
Clickhouse (03) how to install and deploy Clickhouse
@Extension、@SPI注解原理
U-Net: Convolutional Networks for Biomedical Images Segmentation
案例分享|金融业数据运营运维一体化建设