当前位置:网站首页>Master the use of auto analyze in data warehouse
Master the use of auto analyze in data warehouse
2022-07-04 21:38:00 【Huawei cloud developer Alliance】
Abstract :analyze Whether the implementation is timely , To some extent, it directly determines SQL Speed of execution .
This article is shared from Huawei cloud community 《 Article to read autoanalyze Use 【 Gauss is not a mathematician this time 】》, author : leapdb.
analyze Whether the implementation is timely , To some extent, it directly determines SQL Speed of execution . therefore ,GaussDB(DWS) Automatic statistical information collection is introduced , It can make users no longer worry about whether the statistical information is expired .
1. Automatically collect scenes
There are usually five scenarios where automatic statistical information collection is required : Batch DML At the end , The incremental DML At the end ,DDL At the end , Query start and background scheduled tasks .
therefore , In order to avoid being against DML,DDL Unnecessary performance overhead and deadlock risk , We chose to trigger before the query started analzye.
2. Automatic collection principle
GaussDB(DWS) stay SQL In the process of execution , It will record the addition, deletion, modification and query of relevant runtime Statistics , And record the shared memory after the transaction is committed or rolled back .
This information can be obtained through “pg_stat_all_tables View ” Inquire about , You can also query through the following functions .
pg_stat_get_tuples_inserted -- Table accumulation insert Number of pieces pg_stat_get_tuples_updated -- Table accumulation update Number of pieces pg_stat_get_tuples_deleted -- Table accumulation delete Number of pieces pg_stat_get_tuples_changed -- Table since last analyze since , Number of changes pg_stat_get_last_analyze_time -- Query the last analyze Time
therefore , Based on shared memory " Table since last analyze Number of entries modified since " Whether a certain threshold is exceeded , You can decide whether you need to do analyze 了 .
3. Automatically collect thresholds
3.1 Global threshold
autovacuum_analyze_threshold # The table triggers analyze Minimum modification of autovacuum_analyze_scale_factor # The table triggers analyze Percentage of changes when
When " Table since last analyze Number of entries modified since " >= autovacuum_analyze_threshold + Table estimated size * autovacuum_analyze_scale_factor when , It needs to be triggered automatically analyze.
3.2 Table level threshold
-- Set table level threshold ALTER TABLE item SET (autovacuum_analyze_threshold=50);ALTER TABLE item SET (autovacuum_analyze_scale_factor=0.1);-- Query threshold postgres=# select pg_options_to_table(reloptions) from pg_class where relname='item'; pg_options_to_table --------------------------------------- (autovacuum_analyze_threshold,50) (autovacuum_analyze_scale_factor,0.1)(2 rows)-- Reset threshold ALTER TABLE item RESET (autovacuum_analyze_threshold);ALTER TABLE item RESET (autovacuum_analyze_scale_factor);
The data characteristics of different tables are different , Need to trigger analyze The threshold may have different requirements . The table level threshold priority is higher than the global threshold .
3.3 Check whether the modification amount of the table exceeds the threshold ( Only the current CN)
postgres=# select pg_stat_get_local_analyze_status('t_analyze'::regclass); pg_stat_get_local_analyze_status ---------------------------------- Analyze not needed(1 row)
4. Automatic collection method
GaussDB(DWS) Automatic analysis of the following table in three scenarios is provided .
- When there is “ Statistics are completely missing ” or “ The modification amount reaches analyze threshold ” Table of , And the implementation plan does not take FQS (Fast Query Shipping) Execution time , Through autoanalyze Control the automatic collection of statistical information in the following table in this scenario . here , The query statement will wait for the statistics to be collected successfully , Generate a better execution plan , Then execute the original query statement .
- When autovacuum Set to on when , The system will start regularly autovacuum Threads , Yes “ The modification amount reaches analyze threshold ” The table automatically collects statistical information in the background .
5. Freeze Statistics
5.1 Freeze table distinct value
When a watch distinct It's always inaccurate , for example : Data pile up and repeat the scene . If the watch distinct Fixed value , You can freeze the table in the following ways distinct value .
postgres=# alter table lineitem alter l_orderkey set (n_distinct=0.9);ALTER TABLEpostgres=# select relname,attname,attoptions from pg_attribute a,pg_class c where c.oid=a.attrelid and attname='l_orderkey'; relname | attname | attoptions ----------+------------+------------------ lineitem | l_orderkey | {n_distinct=0.9}(1 row)postgres=# alter table lineitem alter l_orderkey reset (n_distinct);ALTER TABLEpostgres=# select relname,attname,attoptions from pg_attribute a,pg_class c where c.oid=a.attrelid and attname='l_orderkey'; relname | attname | attoptions ----------+------------+------------ lineitem | l_orderkey | (1 row)
5.2. Freeze all statistics of the table
If the data characteristics of the table are basically unchanged , You can also freeze the statistics of the table , To avoid repeating analyze.
alter table table_name set frozen_stats=true;
6. Manually check whether the table needs to be done analyze
a. I don't want to trigger the database background task during the business peak , So I don't want to open autovacuum To trigger analyze, What do I do ?
b. The business has modified a number of tables , I want to do these watches right away analyze, I don't know what watches are there , What do I do ?
c. Before the business peak comes, I want to do a test on the tables near the threshold analyze, What do I do ?
We will autovacuum Check the threshold to determine whether analyze Logic , Extraction becomes a function , Help users flexibly and proactively check which tables need to be done analyze.
6.1 Determine whether the table needs analyze( Serial version , Applicable to all historical versions )
-- the function for get all pg_stat_activity information in all CN of current cluster.CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text)RETURNS BOOlAS $$DECLARE row_data record; coor_name record; fet_active text; fetch_coor text; relTuples int4; changedTuples int4:= 0; rel_anl_threshold int4; rel_anl_scale_factor float4; sys_anl_threshold int4; sys_anl_scale_factor float4; anl_threshold int4; anl_scale_factor float4; need_analyze bool := false; BEGIN --Get all the node names fetch_coor := 'SELECT node_name FROM pgxc_node WHERE node_type=''C'''; FOR coor_name IN EXECUTE(fetch_coor) LOOP fet_active := 'EXECUTE DIRECT ON (' || coor_name.node_name || ') ''SELECT pg_stat_get_tuples_changed(oid) from pg_class where relname = ''''|| table_name ||'''';'''; FOR row_data IN EXECUTE(fet_active) LOOP changedTuples = changedTuples + row_data.pg_stat_get_tuples_changed; END LOOP; END LOOP; EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples; EXECUTE 'show autovacuum_analyze_threshold;' into sys_anl_threshold; EXECUTE 'show autovacuum_analyze_scale_factor;' into sys_anl_scale_factor; EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_threshold'') as value from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_threshold; EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_scale_factor'') as value from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_scale_factor; --dbms_output.put_line('relTuples='||relTuples||'; sys_anl_threshold='||sys_anl_threshold||'; sys_anl_scale_factor='||sys_anl_scale_factor||'; rel_anl_threshold='||rel_anl_threshold||'; rel_anl_scale_factor='||rel_anl_scale_factor||';'); if rel_anl_threshold IS NOT NULL then anl_threshold = rel_anl_threshold; else anl_threshold = sys_anl_threshold; end if; if rel_anl_scale_factor IS NOT NULL then anl_scale_factor = rel_anl_scale_factor; else anl_scale_factor = sys_anl_scale_factor; end if; if changedTuples > anl_threshold + anl_scale_factor * relTuples then need_analyze := true; end if; return need_analyze; END; $$LANGUAGE 'plpgsql';
6.2 Determine whether the table needs analyze( Parallel Edition , For versions that support parallel execution frameworks )
-- the function for get all pg_stat_activity information in all CN of current cluster.--SELECT sum(a) FROM pg_catalog.pgxc_parallel_query('cn', 'SELECT 1::int FROM pg_class LIMIT 10') AS (a int); Using concurrent execution framework CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text)RETURNS BOOlAS $$DECLARE relTuples int4; changedTuples int4:= 0; rel_anl_threshold int4; rel_anl_scale_factor float4; sys_anl_threshold int4; sys_anl_scale_factor float4; anl_threshold int4; anl_scale_factor float4; need_analyze bool := false; BEGIN --Get all the node names EXECUTE 'SELECT sum(a) FROM pg_catalog.pgxc_parallel_query(''cn'', ''SELECT pg_stat_get_tuples_changed(oid)::int4 from pg_class where relname = ''''|| table_name ||'''';'') AS (a int4);' into changedTuples; EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples; EXECUTE 'show autovacuum_analyze_threshold;' into sys_anl_threshold; EXECUTE 'show autovacuum_analyze_scale_factor;' into sys_anl_scale_factor; EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_threshold'') as value from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_threshold; EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_scale_factor'') as value from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_scale_factor; dbms_output.put_line('relTuples='||relTuples||'; sys_anl_threshold='||sys_anl_threshold||'; sys_anl_scale_factor='||sys_anl_scale_factor||'; rel_anl_threshold='||rel_anl_threshold||'; rel_anl_scale_factor='||rel_anl_scale_factor||';'); if rel_anl_threshold IS NOT NULL then anl_threshold = rel_anl_threshold; else anl_threshold = sys_anl_threshold; end if; if rel_anl_scale_factor IS NOT NULL then anl_scale_factor = rel_anl_scale_factor; else anl_scale_factor = sys_anl_scale_factor; end if; if changedTuples > anl_threshold + anl_scale_factor * relTuples then need_analyze := true; end if; return need_analyze; END; $$LANGUAGE 'plpgsql';
6.3 Determine whether the table needs analyze( Custom threshold )
-- the function for get all pg_stat_activity information in all CN of current cluster.CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text, int anl_threshold, float anl_scale_factor)RETURNS BOOlAS $$DECLARE relTuples int4; changedTuples int4:= 0; need_analyze bool := false; BEGIN --Get all the node names EXECUTE 'SELECT sum(a) FROM pg_catalog.pgxc_parallel_query(''cn'', ''SELECT pg_stat_get_tuples_changed(oid)::int4 from pg_class where relname = ''''|| table_name ||'''';'') AS (a int4);' into changedTuples; EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples; if changedTuples > anl_threshold + anl_scale_factor * relTuples then need_analyze := true; end if; return need_analyze; END; $$LANGUAGE 'plpgsql';
through “ Optimizer triggered real-time analyze” and “ backstage autovacuum Triggered polling analyze”,GaussDB(DWS) It has been possible to make users no longer care about whether the table needs analyze. It is recommended to try in the latest version .
Click to follow , The first time to learn about Huawei's new cloud technology ~
边栏推荐
- 2021 CCPC Harbin B. magical subsequence (thinking question)
- Kubeadm初始化报错:[ERROR CRI]: container runtime is not running
- 巅峰不止,继续奋斗!城链科技数字峰会于重庆隆重举行
- A quick start to fastdfs takes you three minutes to upload and download files to the ECS
- Difference between ApplicationContext and beanfactory (MS)
- Jerry's ad series MIDI function description [chapter]
- Redis cache
- Use of redis publish subscription
- 华为ensp模拟器 三层交换机
- 超详细教程,一文入门Istio架构原理及实战应用
猜你喜欢
Compréhension approfondie du symbole [langue C]
开源之夏专访|Apache IoTDB社区 新晋Committer谢其骏
redis03——Redis的网络配置与心跳机制
杰理之AD 系列 MIDI 功能说明【篇】
At the right time, the Guangzhou station of the city chain science and Technology Strategy Summit was successfully held
解析steam教育中蕴含的众创空间
Methods of improving machine vision system
【活动早知道】LiveVideoStack近期活动一览
Introduction to pressure measurement of JMeter
torch.tensor和torch.Tensor的区别
随机推荐
【公开课预告】:视频质量评价基础与实践
IIC (STM32)
Jerry's ad series MIDI function description [chapter]
面试官:说说XSS攻击是什么?
杰理之AD 系列 MIDI 功能说明【篇】
Compréhension approfondie du symbole [langue C]
Daily question-leetcode556-next larger element iii-string-double pointer-next_ permutation
redis事务
华为模拟器ensp的路由配置以及连通测试
华为ensp模拟器 配置ACL访问控制列表
Huawei ENSP simulator layer 3 switch
How was MP3 born?
Day24:文件系统
Jerry's ad series MIDI function description [chapter]
ArcGIS 10.2.2 | solution to the failure of ArcGIS license server to start
华为ensp模拟器 三层交换机
旋变串判断
Redis cache
Huawei ENSP simulator enables devices of multiple routers to access each other
Maidong Internet won the bid of Beijing life insurance