当前位置:网站首页>Master the use of auto analyze in data warehouse
Master the use of auto analyze in data warehouse
2022-07-04 17:29:00 【Huawei cloud developer Alliance】
Abstract :analyze Whether the implementation is timely , To some extent, it directly determines SQL Speed of execution .
This article is shared from Huawei cloud community 《 Article to read autoanalyze Use 【 Gauss is not a mathematician this time 】》, author : leapdb.
analyze Whether the implementation is timely , To some extent, it directly determines SQL Speed of execution . therefore ,GaussDB(DWS) Automatic statistical information collection is introduced , It can make users no longer worry about whether the statistical information is expired .
1. Automatically collect scenes
There are usually five scenarios where automatic statistical information collection is required : Batch DML At the end , The incremental DML At the end ,DDL At the end , Query start and background scheduled tasks .
therefore , In order to avoid being against DML,DDL Unnecessary performance overhead and deadlock risk , We chose to trigger before the query started analzye.
2. Automatic collection principle
GaussDB(DWS) stay SQL In the process of execution , It will record the addition, deletion, modification and query of relevant runtime Statistics , And record the shared memory after the transaction is committed or rolled back .
This information can be obtained through “pg_stat_all_tables View ” Inquire about , You can also query through the following functions .
pg_stat_get_tuples_inserted -- Table accumulation insert Number of pieces
pg_stat_get_tuples_updated -- Table accumulation update Number of pieces
pg_stat_get_tuples_deleted -- Table accumulation delete Number of pieces
pg_stat_get_tuples_changed -- Table since last analyze since , Number of changes
pg_stat_get_last_analyze_time -- Query the last analyze Time
therefore , Based on shared memory " Table since last analyze Number of entries modified since " Whether a certain threshold is exceeded , You can decide whether you need to do analyze 了 .
3. Automatically collect thresholds
3.1 Global threshold
autovacuum_analyze_threshold # The table triggers analyze Minimum modification of
autovacuum_analyze_scale_factor # The table triggers analyze Percentage of changes when
When " Table since last analyze Number of entries modified since " >= autovacuum_analyze_threshold + Table estimated size * autovacuum_analyze_scale_factor when , It needs to be triggered automatically analyze.
3.2 Table level threshold
-- Set table level threshold
ALTER TABLE item SET (autovacuum_analyze_threshold=50);
ALTER TABLE item SET (autovacuum_analyze_scale_factor=0.1);
-- Query threshold
postgres=# select pg_options_to_table(reloptions) from pg_class where relname='item';
pg_options_to_table
---------------------------------------
(autovacuum_analyze_threshold,50)
(autovacuum_analyze_scale_factor,0.1)
(2 rows)
-- Reset threshold
ALTER TABLE item RESET (autovacuum_analyze_threshold);
ALTER TABLE item RESET (autovacuum_analyze_scale_factor);
The data characteristics of different tables are different , Need to trigger analyze The threshold may have different requirements . The table level threshold priority is higher than the global threshold .
3.3 Check whether the modification amount of the table exceeds the threshold ( Only the current CN)
postgres=# select pg_stat_get_local_analyze_status('t_analyze'::regclass);
pg_stat_get_local_analyze_status
----------------------------------
Analyze not needed
(1 row)
4. Automatic collection method
GaussDB(DWS) Automatic analysis of the following table in three scenarios is provided .
- When there is “ Statistics are completely missing ” or “ The modification amount reaches analyze threshold ” Table of , And the implementation plan does not take FQS (Fast Query Shipping) Execution time , Through autoanalyze Control the automatic collection of statistical information in the following table in this scenario . here , The query statement will wait for the statistics to be collected successfully , Generate a better execution plan , Then execute the original query statement .
- When autovacuum Set to on when , The system will start regularly autovacuum Threads , Yes “ The modification amount reaches analyze threshold ” The table automatically collects statistical information in the background .
5. Freeze Statistics
5.1 Freeze table distinct value
When a watch distinct It's always inaccurate , for example : Data pile up and repeat the scene . If the watch distinct Fixed value , You can freeze the table in the following ways distinct value .
postgres=# alter table lineitem alter l_orderkey set (n_distinct=0.9);
ALTER TABLE
postgres=# select relname,attname,attoptions from pg_attribute a,pg_class c where c.oid=a.attrelid and attname='l_orderkey';
relname | attname | attoptions
----------+------------+------------------
lineitem | l_orderkey | {n_distinct=0.9}
(1 row)
postgres=# alter table lineitem alter l_orderkey reset (n_distinct);
ALTER TABLE
postgres=# select relname,attname,attoptions from pg_attribute a,pg_class c where c.oid=a.attrelid and attname='l_orderkey';
relname | attname | attoptions
----------+------------+------------
lineitem | l_orderkey |
(1 row)
5.2. Freeze all statistics of the table
If the data characteristics of the table are basically unchanged , You can also freeze the statistics of the table , To avoid repeating analyze.
alter table table_name set frozen_stats=true;
6. Manually check whether the table needs to be done analyze
a. I don't want to trigger the database background task during the business peak , So I don't want to open autovacuum To trigger analyze, What do I do ?
b. The business has modified a number of tables , I want to do these watches right away analyze, I don't know what watches are there , What do I do ?
c. Before the business peak comes, I want to do a test on the tables near the threshold analyze, What do I do ?
We will autovacuum Check the threshold to determine whether analyze Logic , Extraction becomes a function , Help users flexibly and proactively check which tables need to be done analyze.
6.1 Determine whether the table needs analyze( Serial version , Applicable to all historical versions )
-- the function for get all pg_stat_activity information in all CN of current cluster.
CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text)
RETURNS BOOl
AS $$
DECLARE
row_data record;
coor_name record;
fet_active text;
fetch_coor text;
relTuples int4;
changedTuples int4:= 0;
rel_anl_threshold int4;
rel_anl_scale_factor float4;
sys_anl_threshold int4;
sys_anl_scale_factor float4;
anl_threshold int4;
anl_scale_factor float4;
need_analyze bool := false;
BEGIN
--Get all the node names
fetch_coor := 'SELECT node_name FROM pgxc_node WHERE node_type=''C''';
FOR coor_name IN EXECUTE(fetch_coor) LOOP
fet_active := 'EXECUTE DIRECT ON (' || coor_name.node_name || ') ''SELECT pg_stat_get_tuples_changed(oid) from pg_class where relname = ''''|| table_name ||'''';''';
FOR row_data IN EXECUTE(fet_active) LOOP
changedTuples = changedTuples + row_data.pg_stat_get_tuples_changed;
END LOOP;
END LOOP;
EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples;
EXECUTE 'show autovacuum_analyze_threshold;' into sys_anl_threshold;
EXECUTE 'show autovacuum_analyze_scale_factor;' into sys_anl_scale_factor;
EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_threshold'') as value
from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_threshold;
EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_scale_factor'') as value
from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_scale_factor;
--dbms_output.put_line('relTuples='||relTuples||'; sys_anl_threshold='||sys_anl_threshold||'; sys_anl_scale_factor='||sys_anl_scale_factor||'; rel_anl_threshold='||rel_anl_threshold||'; rel_anl_scale_factor='||rel_anl_scale_factor||';');
if rel_anl_threshold IS NOT NULL then
anl_threshold = rel_anl_threshold;
else
anl_threshold = sys_anl_threshold;
end if;
if rel_anl_scale_factor IS NOT NULL then
anl_scale_factor = rel_anl_scale_factor;
else
anl_scale_factor = sys_anl_scale_factor;
end if;
if changedTuples > anl_threshold + anl_scale_factor * relTuples then
need_analyze := true;
end if;
return need_analyze;
END; $$
LANGUAGE 'plpgsql';
6.2 Determine whether the table needs analyze( Parallel Edition , For versions that support parallel execution frameworks )
-- the function for get all pg_stat_activity information in all CN of current cluster.
--SELECT sum(a) FROM pg_catalog.pgxc_parallel_query('cn', 'SELECT 1::int FROM pg_class LIMIT 10') AS (a int); Using concurrent execution framework
CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text)
RETURNS BOOl
AS $$
DECLARE
relTuples int4;
changedTuples int4:= 0;
rel_anl_threshold int4;
rel_anl_scale_factor float4;
sys_anl_threshold int4;
sys_anl_scale_factor float4;
anl_threshold int4;
anl_scale_factor float4;
need_analyze bool := false;
BEGIN
--Get all the node names
EXECUTE 'SELECT sum(a) FROM pg_catalog.pgxc_parallel_query(''cn'', ''SELECT pg_stat_get_tuples_changed(oid)::int4 from pg_class where relname = ''''|| table_name ||'''';'') AS (a int4);' into changedTuples;
EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples;
EXECUTE 'show autovacuum_analyze_threshold;' into sys_anl_threshold;
EXECUTE 'show autovacuum_analyze_scale_factor;' into sys_anl_scale_factor;
EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_threshold'') as value
from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_threshold;
EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_scale_factor'') as value
from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_scale_factor;
dbms_output.put_line('relTuples='||relTuples||'; sys_anl_threshold='||sys_anl_threshold||'; sys_anl_scale_factor='||sys_anl_scale_factor||'; rel_anl_threshold='||rel_anl_threshold||'; rel_anl_scale_factor='||rel_anl_scale_factor||';');
if rel_anl_threshold IS NOT NULL then
anl_threshold = rel_anl_threshold;
else
anl_threshold = sys_anl_threshold;
end if;
if rel_anl_scale_factor IS NOT NULL then
anl_scale_factor = rel_anl_scale_factor;
else
anl_scale_factor = sys_anl_scale_factor;
end if;
if changedTuples > anl_threshold + anl_scale_factor * relTuples then
need_analyze := true;
end if;
return need_analyze;
END; $$
LANGUAGE 'plpgsql';
6.3 Determine whether the table needs analyze( Custom threshold )
-- the function for get all pg_stat_activity information in all CN of current cluster.
CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text, int anl_threshold, float anl_scale_factor)
RETURNS BOOl
AS $$
DECLARE
relTuples int4;
changedTuples int4:= 0;
need_analyze bool := false;
BEGIN
--Get all the node names
EXECUTE 'SELECT sum(a) FROM pg_catalog.pgxc_parallel_query(''cn'', ''SELECT pg_stat_get_tuples_changed(oid)::int4 from pg_class where relname = ''''|| table_name ||'''';'') AS (a int4);' into changedTuples;
EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples;
if changedTuples > anl_threshold + anl_scale_factor * relTuples then
need_analyze := true;
end if;
return need_analyze;
END; $$
LANGUAGE 'plpgsql';
through “ Optimizer triggered real-time analyze” and “ backstage autovacuum Triggered polling analyze”,GaussDB(DWS) It has been possible to make users no longer care about whether the table needs analyze. It is recommended to try in the latest version .
Click to follow , The first time to learn about Huawei's new cloud technology ~
边栏推荐
- 【云原生】服务网格是什么“格”?
- To sort out messy header files, I use include what you use
- Implementation of super large-scale warehouse clusters in large commercial banks
- Solution du système de gestion de la chaîne d'approvisionnement du parc logistique intelligent
- PingCode 性能测试之负载测试实践
- 动态规划股票问题对比
- What grade does Anxin securities belong to? Is it safe to open an account
- 长城证券安全不 证券开户
- 【测试开发】软件测试——基础篇
- 开发者,MySQL专栏完更,助你轻松从安装到入门进阶
猜你喜欢
容器环境minor gc异常频繁分析
Blood spitting finishing nanny level series tutorial - play Fiddler bag grabbing tutorial (2) - first meet fiddler, let you have a rational understanding
Years of training, towards Kata 3.0! Enter the safe container experience out of the box | dragon lizard Technology
Hidden corners of coder Edition: five things that developers hate most
PingCode 性能测试之负载测试实践
S2b2b solution for lighting industry: efficiently enable the industrial supply chain and improve the economic benefits of enterprises
Redis 的内存淘汰策略和过期删除策略的区别
2022年国内云管平台厂商哪家好?为什么?
It's too convenient. You can complete the code release and approval by nailing it!
La 18e Conférence internationale de l'IET sur le transport d'électricité en courant alternatif et en courant continu (acdc2022) s'est tenue avec succès en ligne.
随机推荐
新的职业已经出现,怎么能够停滞不前 ,人社部公布建筑新职业
世界环境日 | 周大福用心服务推动减碳环保
Go language loop statement (under Lesson 10)
智捷云——元宇宙综合解决方案服务商
kaili不能输入中文怎么办???
电子元器件B2B商城系统开发:赋能企业构建进销存标准化流程实例
Is it safe for Anxin securities to open an account online? Is the account opening fee charged
太方便了,钉钉上就可完成代码发布审批啦!
GO开发:如何利用Go单例模式保障流媒体高并发的安全性?
Is it safe for Great Wall Securities to open an account? How to open a securities account
PyTorch深度学习快速入门教程
【Go ~ 0到1 】 第六天 文件的读写与创建
Capvision Rongying's prospectus in Hong Kong was "invalid": it was strictly questioned by the CSRC and required supplementary disclosure
ble HCI 流控机制
动态规划股票问题对比
The Ministry of human resources and Social Security announced the new construction occupation
Offline and open source version of notation -- comprehensive evaluation of note taking software anytype
NFT流动性市场安全问题频发—NFT交易平台Quixotic被黑事件分析
超大规模数仓集群在大型商业银行的落地实践
Redis 的内存淘汰策略和过期删除策略的区别