当前位置：网站首页>Master the use of auto analyze in data warehouse

Master the use of auto analyze in data warehouse

2022-07-04 17:29:00 【Huawei cloud developer Alliance】

Abstract ：analyze Whether the implementation is timely , To some extent, it directly determines SQL Speed of execution .

This article is shared from Huawei cloud community 《 Article to read autoanalyze Use 【 Gauss is not a mathematician this time 】》, author ： leapdb.

analyze Whether the implementation is timely , To some extent, it directly determines SQL Speed of execution . therefore ,GaussDB(DWS) Automatic statistical information collection is introduced , It can make users no longer worry about whether the statistical information is expired .

1. Automatically collect scenes

There are usually five scenarios where automatic statistical information collection is required ： Batch DML At the end , The incremental DML At the end ,DDL At the end , Query start and background scheduled tasks .

therefore , In order to avoid being against DML,DDL Unnecessary performance overhead and deadlock risk , We chose to trigger before the query started analzye.

2. Automatic collection principle

GaussDB(DWS) stay SQL In the process of execution , It will record the addition, deletion, modification and query of relevant runtime Statistics , And record the shared memory after the transaction is committed or rolled back .

This information can be obtained through “pg_stat_all_tables View ” Inquire about , You can also query through the following functions .

pg_stat_get_tuples_inserted   -- Table accumulation insert Number of pieces 
pg_stat_get_tuples_updated    -- Table accumulation update Number of pieces 
pg_stat_get_tuples_deleted    -- Table accumulation delete Number of pieces 
pg_stat_get_tuples_changed    -- Table since last analyze since , Number of changes 
pg_stat_get_last_analyze_time -- Query the last analyze Time

therefore , Based on shared memory " Table since last analyze Number of entries modified since " Whether a certain threshold is exceeded , You can decide whether you need to do analyze 了 .

3. Automatically collect thresholds

3.1 Global threshold

autovacuum_analyze_threshold # The table triggers analyze Minimum modification of 
autovacuum_analyze_scale_factor # The table triggers analyze Percentage of changes when

When " Table since last analyze Number of entries modified since " >= autovacuum_analyze_threshold + Table estimated size * autovacuum_analyze_scale_factor when , It needs to be triggered automatically analyze.

3.2 Table level threshold

-- Set table level threshold 
ALTER TABLE item SET (autovacuum_analyze_threshold=50);
ALTER TABLE item SET (autovacuum_analyze_scale_factor=0.1);

-- Query threshold 
postgres=# select pg_options_to_table(reloptions) from pg_class where relname='item';
          pg_options_to_table          
---------------------------------------
 (autovacuum_analyze_threshold,50)
 (autovacuum_analyze_scale_factor,0.1)
(2 rows)

-- Reset threshold 
ALTER TABLE item RESET (autovacuum_analyze_threshold);
ALTER TABLE item RESET (autovacuum_analyze_scale_factor);

The data characteristics of different tables are different , Need to trigger analyze The threshold may have different requirements . The table level threshold priority is higher than the global threshold .

3.3 Check whether the modification amount of the table exceeds the threshold （ Only the current CN）

postgres=# select pg_stat_get_local_analyze_status('t_analyze'::regclass);
 pg_stat_get_local_analyze_status 
----------------------------------
 Analyze not needed
(1 row)

4. Automatic collection method

GaussDB(DWS) Automatic analysis of the following table in three scenarios is provided .

When there is “ Statistics are completely missing ” or “ The modification amount reaches analyze threshold ” Table of , And the implementation plan does not take FQS (Fast Query Shipping) Execution time , Through autoanalyze Control the automatic collection of statistical information in the following table in this scenario . here , The query statement will wait for the statistics to be collected successfully , Generate a better execution plan , Then execute the original query statement .
When autovacuum Set to on when , The system will start regularly autovacuum Threads , Yes “ The modification amount reaches analyze threshold ” The table automatically collects statistical information in the background .

5. Freeze Statistics

5.1 Freeze table distinct value

When a watch distinct It's always inaccurate , for example ： Data pile up and repeat the scene . If the watch distinct Fixed value , You can freeze the table in the following ways distinct value .

postgres=# alter table lineitem alter l_orderkey set (n_distinct=0.9);
ALTER TABLE

postgres=# select relname,attname,attoptions from pg_attribute a,pg_class c where c.oid=a.attrelid and attname='l_orderkey';
 relname  |  attname   |    attoptions    
----------+------------+------------------
 lineitem | l_orderkey | {n_distinct=0.9}
(1 row)

postgres=# alter table lineitem alter l_orderkey reset (n_distinct);
ALTER TABLE

postgres=# select relname,attname,attoptions from pg_attribute a,pg_class c where c.oid=a.attrelid and attname='l_orderkey';
 relname  |  attname   | attoptions 
----------+------------+------------
 lineitem | l_orderkey | 
(1 row)

5.2. Freeze all statistics of the table

If the data characteristics of the table are basically unchanged , You can also freeze the statistics of the table , To avoid repeating analyze.

alter table table_name set frozen_stats=true;

6. Manually check whether the table needs to be done analyze

a. I don't want to trigger the database background task during the business peak , So I don't want to open autovacuum To trigger analyze, What do I do ？

b. The business has modified a number of tables , I want to do these watches right away analyze, I don't know what watches are there , What do I do ？

c. Before the business peak comes, I want to do a test on the tables near the threshold analyze, What do I do ？

We will autovacuum Check the threshold to determine whether analyze Logic , Extraction becomes a function , Help users flexibly and proactively check which tables need to be done analyze.

6.1 Determine whether the table needs analyze（ Serial version , Applicable to all historical versions ）

-- the function for get all pg_stat_activity information in all CN of current cluster.
CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text)
RETURNS BOOl
AS $$
DECLARE
    row_data record;
    coor_name record;
    fet_active text;
    fetch_coor text;
    relTuples int4;
    changedTuples int4:= 0;
    rel_anl_threshold int4;
    rel_anl_scale_factor float4;
    sys_anl_threshold int4;
    sys_anl_scale_factor float4;
    anl_threshold int4;
    anl_scale_factor float4;
    need_analyze bool := false;
    BEGIN
        --Get all the node names
        fetch_coor := 'SELECT node_name FROM pgxc_node WHERE node_type=''C''';
        FOR coor_name IN EXECUTE(fetch_coor) LOOP 
            fet_active := 'EXECUTE DIRECT ON (' || coor_name.node_name || ') ''SELECT pg_stat_get_tuples_changed(oid) from pg_class where relname = ''''|| table_name ||'''';''';
            FOR row_data IN EXECUTE(fet_active) LOOP 
                changedTuples = changedTuples + row_data.pg_stat_get_tuples_changed;
            END LOOP;
        END LOOP;

        EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples;
        EXECUTE 'show autovacuum_analyze_threshold;' into sys_anl_threshold;
        EXECUTE 'show autovacuum_analyze_scale_factor;' into sys_anl_scale_factor;

        EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_threshold'') as value 
        from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_threshold;

        EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_scale_factor'') as value 
        from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_scale_factor;
 
        --dbms_output.put_line('relTuples='||relTuples||'; sys_anl_threshold='||sys_anl_threshold||'; sys_anl_scale_factor='||sys_anl_scale_factor||'; rel_anl_threshold='||rel_anl_threshold||'; rel_anl_scale_factor='||rel_anl_scale_factor||';');
        if rel_anl_threshold IS NOT NULL then
            anl_threshold = rel_anl_threshold;
        else
            anl_threshold = sys_anl_threshold;
        end if;
        if rel_anl_scale_factor IS NOT NULL then
            anl_scale_factor = rel_anl_scale_factor;
        else
            anl_scale_factor = sys_anl_scale_factor;
        end if;

        if changedTuples > anl_threshold + anl_scale_factor * relTuples then
            need_analyze := true;
        end if;

        return need_analyze;
    END; $$
LANGUAGE 'plpgsql';

6.2 Determine whether the table needs analyze（ Parallel Edition , For versions that support parallel execution frameworks ）

-- the function for get all pg_stat_activity information in all CN of current cluster.
--SELECT sum(a) FROM pg_catalog.pgxc_parallel_query('cn', 'SELECT 1::int FROM pg_class LIMIT 10') AS (a int);  Using concurrent execution framework 
CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text)
RETURNS BOOl
AS $$
DECLARE
    relTuples int4;
    changedTuples int4:= 0;
    rel_anl_threshold int4;
    rel_anl_scale_factor float4;
    sys_anl_threshold int4;
    sys_anl_scale_factor float4;
    anl_threshold int4;
    anl_scale_factor float4;
    need_analyze bool := false;
    BEGIN
        --Get all the node names
        EXECUTE 'SELECT sum(a) FROM pg_catalog.pgxc_parallel_query(''cn'', ''SELECT pg_stat_get_tuples_changed(oid)::int4 from pg_class where relname = ''''|| table_name ||'''';'') AS (a int4);' into changedTuples;
        EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples;

        EXECUTE 'show autovacuum_analyze_threshold;' into sys_anl_threshold;
        EXECUTE 'show autovacuum_analyze_scale_factor;' into sys_anl_scale_factor;

        EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_threshold'') as value 
        from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_threshold;

        EXECUTE 'select (select option_value from pg_options_to_table(c.reloptions) where option_name = ''autovacuum_analyze_scale_factor'') as value 
        from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into rel_anl_scale_factor;
 
        dbms_output.put_line('relTuples='||relTuples||'; sys_anl_threshold='||sys_anl_threshold||'; sys_anl_scale_factor='||sys_anl_scale_factor||'; rel_anl_threshold='||rel_anl_threshold||'; rel_anl_scale_factor='||rel_anl_scale_factor||';');
        if rel_anl_threshold IS NOT NULL then
            anl_threshold = rel_anl_threshold;
        else
            anl_threshold = sys_anl_threshold;
        end if;
        if rel_anl_scale_factor IS NOT NULL then
            anl_scale_factor = rel_anl_scale_factor;
        else
            anl_scale_factor = sys_anl_scale_factor;
        end if;

        if changedTuples > anl_threshold + anl_scale_factor * relTuples then
            need_analyze := true;
        end if;

        return need_analyze;
    END; $$
LANGUAGE 'plpgsql';

6.3 Determine whether the table needs analyze（ Custom threshold ）

-- the function for get all pg_stat_activity information in all CN of current cluster.
CREATE OR REPLACE FUNCTION pg_catalog.pgxc_stat_table_need_analyze(in table_name text, int anl_threshold, float anl_scale_factor)
RETURNS BOOl
AS $$
DECLARE
    relTuples int4;
    changedTuples int4:= 0;
    need_analyze bool := false;
    BEGIN
        --Get all the node names
        EXECUTE 'SELECT sum(a) FROM pg_catalog.pgxc_parallel_query(''cn'', ''SELECT pg_stat_get_tuples_changed(oid)::int4 from pg_class where relname = ''''|| table_name ||'''';'') AS (a int4);' into changedTuples;
        EXECUTE 'select pg_stat_get_live_tuples(oid) from pg_class c where c.oid = '''|| table_name ||'''::REGCLASS;' into relTuples;

        if changedTuples > anl_threshold + anl_scale_factor * relTuples then
            need_analyze := true;
        end if;

        return need_analyze;
    END; $$
LANGUAGE 'plpgsql';

through “ Optimizer triggered real-time analyze” and “ backstage autovacuum Triggered polling analyze”,GaussDB(DWS) It has been possible to make users no longer care about whether the table needs analyze. It is recommended to try in the latest version .

Click to follow , The first time to learn about Huawei's new cloud technology ~

原网站

版权声明
本文为[Huawei cloud developer Alliance]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/185/202207041533360579.html