当前位置:网站首页>Db4ai: database driven AI

Db4ai: database driven AI

2022-06-11 15:48:00 Gauss squirrel Club

DB4AI It refers to the ability to use the database to drive AI Mission , Realize data storage 、 Isomorphism of technology stack . By integrating in the database AI Algorithm , Make openGauss Database native AI Calculation engine 、 Model management 、AI operator 、AI The ability to execute a plan , Provide users with universal benefits AI technology . Different from traditional AI Modeling process ,DB4AI“ "One-stop" work style ” Modeling can solve the problem of repeated flow of data on various platforms , At the same time, simplify the development process , And the optimal execution path can be planned through the database , Let developers focus more on the tuning of specific business and models , It has the ease of use and performance advantages that similar products do not have .

One 、 Native DB4AI engine

Two 、 The whole process AI

One 、 Native DB4AI engine

openGauss The current version supports native DB4AI Ability , By introducing native AI operator , Simplify the operation process , Make full use of the database optimizer 、 Optimization and execution capability of actuator , Obtain high-performance in database model training ability . More simplified model training and prediction process 、 Higher performance , So that developers can focus more on model tuning and data analysis in a shorter time , The fragmented technology stack and redundant code implementation are avoided .

Keyword parsing

surface 1 DB4AI Syntax and keywords

name

describe

grammar

CREATE MODEL

Create models and train , Also save the model .

PREDICT BY

Use the existing model to infer .

DROP MODEL

Delete the model .

keyword

TARGET

Training / Infer the target column name of the task .

FEATURES

Training / Infer the data characteristic column name of the task .

MODEL

Model name of training task .

Instructions for use

  1. Overview of algorithms supported in this version .

    Current version DB4AI The new supported algorithms are as follows :

    surface 2  Support algorithm

    optimization algorithm

    Algorithm

    GD

    logistic_regression

    linear_regression

    svm_classification

    PCA

    multiclass

    Kmeans

    kmeans

    xgboost

    xgboost_regression_logistic

    xgboost_binary_logistic

    xgboost_regression_squarederror

    xgboost_regression_gamma

  2. Model training syntax description .

    • CREATE MODEL

      Use “CREATE MODEL” Statements can be used to create and train models . model training SQL sentence , Select the public data set iris data set iris.

    • With multiclass For example , Train a model . from tb_iris The training set specifies sepal_length, sepal_width,petal_length,petal_widt For the feature column , Use multiclass Algorithm , Create and save the model iris_classification_model.

      openGauss=# CREATE MODEL iris_classification_model USING xgboost_regression_logistic FEATURES sepal_length, sepal_width,petal_length,petal_width TARGET target_type < 2 FROM tb_iris_1 WITH nthread=4, max_depth=8;
      MODEL CREATED. PROCESSED 1
      

      In the above order :

      • “CREATE MODEL” Statement is used to train and save the model .
      • USING Keyword specifies the algorithm name .
      • FEATURES Used to specify the characteristics of the training model , You need to add... According to the column name of the training data table .
      • TARGET Specify the training objectives of the model , It can be the column name of the data table required for training , It can also be an expression , for example : price > 10000.
      • WITH Used to specify the hyperparameters when training the model . When the super parameter is not set by the user , The frame will use the default value .

        For different operators , The framework supports different hyperparametric combinations :

        surface 3  Operator supported hyperparameters

        operator

        Hyperparametric

        GD

        (logistic_regression、linear_regression、svm_classification)

        optimizer(char); verbose(bool); max_iterations(int); max_seconds(double); batch_size(int); learning_rate(double); decay(double); tolerance(double)

        among ,SVM Restricted hyperparameters lambda(double)

        Kmeans

        max_iterations(int); num_centroids(int); tolerance(double); batch_size(int); num_features(int); distance_function(char); seeding_function(char); verbose(int);seed(int)

        GD(pca)

        batch_size(int);max_iterations(int);max_seconds(int);tolerance(float8);verbose(bool);number_components(int);seed(int)

        GD(multiclass)

        classifier(char)

        Be careful :multiclass Other types of hyperparameters depend on the class in the selected classifier

        xgboost_regression_logistic、xgboost_binary_logistic、xgboost_regression_squarederror、xgboost_regression_gamma

        batch_size(int);booster(char);tree_method(char);eval_metric(char*);seed(int);nthread(int);max_depth(int);gamma(float8);eta(float8);min_child_weight(int);verbosity(int)

        The default values and value ranges of current super parameter settings are as follows :

        surface 4  The default value and value range of the super parameter

        operator

        Hyperparametric ( The default value is )

        Value range

        Hyperparametric description

        GD:

        logistic_regression、linear_regression、svm_classification、pca

        optimizer = gd( Gradient descent method )

        gd/ngd( Natural gradient descent )

        Optimizer

        verbose = false

        T/F

        The log shows

        max_iterations = 100

        (0, 10000]

        Maximum number of iterations

        max_seconds = 0 ( No restrictions on run-time length )

        [0,INT_MAX_VALUE]

        Run time

        batch_size = 1000

        (0, 1048575]

        The number of samples selected for a training

        learning_rate = 0.8

        (0, DOUBLE_MAX_VALUE]

        Learning rate

        decay = 0.95

        (0, DOUBLE_MAX_VALUE]

        Weight decay rate

        tolerance = 0.0005

        (0, DOUBLE_MAX_VALUE]

        tolerance

        seed = 0( Yes seed Take random values )

        [0, INT_MAX_VALUE]

        seeds

        just for linear、SVM:kernel = “linear”

        linear/gaussian/polynomial

        Kernel function

        just for linear、SVM:components = MAX(2*features, 128)

        [0, INT_MAX_VALUE]

        Dimension of high-dimensional space

        just for linear、SVM:gamma = 0.5

        (0, DOUBLE_MAX_VALUE]

        gaussian Kernel function parameters

        just for linear、SVM:degree = 2

        [2, 9]

        polynomial Kernel function parameters

        just for linear、SVM:coef0 = 1.0

        [0, DOUBLE_MAX_VALUE]

        polynomial The parameters of the kernel function

        just for SVM:lambda = 0.01

        (0, DOUBLE_MAX_VALUE)

        Regularization parameters

        just for pca: number_components

        (0,INT_MAX_VALUE]

        Target dimension of dimensionality reduction

        GD:

        multiclass

        classifier=“svm_classification”

        svm_classification\logistic_regression

        Classifier for multi classification task

        Kmeans

        max_iterations = 10

        [1, 10000]

        Maximum number of iterations

        num_centroids = 10

        [1, 1000000]

        Number of clusters

        tolerance = 0.00001

        (0,1]

        Center point error

        batch_size = 10

        [1,1048575]

        The number of samples selected for a training

        num_features = 2

        [1, INT_MAX_VALUE]

        Enter the number of sample features

        distance_function = “L2_Squared”

        L1\L2\L2_Squared\Linf

        Regularization method

        seeding_function = “Random++”

        “Random++”\“KMeans||”

        Initialize seed point method

        verbose = 0U

        { 0, 1, 2 }

        Lengthy mode

        seed = 0U

        [0, INT_MAX_VALUE]

        seeds

        xgboost:

        xgboost_regression_logistic、

        xgboost_binary_logistic、

        xgboost_regression_gamma、xgboost_regression_squarederror

        n_iter=10

        (0, 10000]

        The number of iterations

        batch_size=10000

        (0, 1048575]

        The number of samples selected for a training

        booster=“gbtree”

        gbtree\gblinear\dart

        booster species

        tree_method=“auto”

        auto\exact\approx\hist\gpu_hist

        Be careful :gpu_hist The parameter requires the corresponding library GPU edition , otherwise DB4AI The platform does not support this value .

        Tree building algorithm

        eval_metric=“rmse”

        rmse\rmsle\map\mae\auc\aucpr

        Evaluation indicators for validation data

        seed=0

        [0, 100]

        seeds

        nthread=1

        (0, MAX_MEMORY_LIMIT]

        Concurrency

        max_depth=5

        (0, MAX_MEMORY_LIMIT]

        The maximum depth of the tree , This hyperparameter is only for tree type booster take effect .

        gamma=0.0

        [0, 1]

        The minimum loss required for further zoning on the blade node is reduced

        eta=0.3

        [0, 1]

        The step size used in the update shrinks , To prevent over fitting

        min_child_weight=1

        [0, INT_MAX_VALUE]

        The minimum sum of the instance weights required in the child node

        verbosity=1

        0 (silent)\1 (warning)\2 (info)\3 (debug)

        The level of detail of the printed information

        MAX_MEMORY_LIMIT = Maximum number of tuples loaded in memory

        GS_MAX_COLS = The maximum number of attributes in a single database table

    • The model was saved successfully , Then the creation success information is returned :

      MODEL CREATED. PROCESSED x
      
  3. View model information .

    When the training is completed, the model will be stored in the system table gs_model_warehouse in . The system tables gs_model_warehouse You can view relevant information about the model itself and the training process .

    The detailed description information about the model is stored in the system table in binary form , The user has used functions gs_explain_model Finish viewing the model , The statement is as follows :

    openGauss=# select * from gs_explain_model("iris_classification_model");
     DB4AI MODEL
    -------------------------------------------------------------
     Name: iris_classification_model
     Algorithm: xgboost_regression_logistic
     Query: CREATE MODEL iris_classification_model
     USING xgboost_regression_logistic
     FEATURES sepal_length, sepal_width,petal_length,petal_width
     TARGET target_type < 2
     FROM tb_iris_1
     WITH nthread=4, max_depth=8;
     Return type: Float64
     Pre-processing time: 0.000000
     Execution time: 0.001443
     Processed tuples: 78
     Discarded tuples: 0
     n_iter: 10
     batch_size: 10000
     max_depth: 8
     min_child_weight: 1
     gamma: 0.0000000000
     eta: 0.3000000000
     nthread: 4
     verbosity: 1
     seed: 0
     booster: gbtree
     tree_method: auto
     eval_metric: rmse
     rmse: 0.2648450136
     model size: 4613
    
  4. Use existing models to do inference tasks .

    Use “SELECT” and “PREDICT BY” Keyword use the existing model to complete the inference task .

    The query syntax :SELECT…PREDICT BY…(FEATURES…)…FROM…;

    openGauss=# SELECT id, PREDICT BY iris_classification (FEATURES sepal_length,sepal_width,petal_length,petal_width) as "PREDICT" FROM tb_iris limit 3;
         
    id  | PREDICT
    -----+---------
      84 |       2
      85 |       0
      86 |       0
    (3 rows)
    

    For the same inference task , The results of the same model are roughly stable . The model trained based on the same super parameters and training set also has stability , meanwhile AI There are random components in model training ( Every batch Data distribution of 、 Stochastic gradient descent ), So the calculation performance between different models 、 The results allow for small differences .

  5. View execution plan .

    Use explain Statement can be used for “CREATE MODEL” and “PREDICT BY” The execution plan in the process of model training or prediction is analyzed .Explain Keywords can be directly spliced CREATE MODEL/ PREDICT BY sentence ( Clause ), Optional parameters can also be connected , The supported parameters are as follows :

    surface 5 EXPLAIN Supported parameters

    Parameter name

    describe

    ANALYZE

    Boolean variables , Additional running time 、 Number of cycles and other descriptive information

    VERBOSE

    Boolean variables , Control whether the running information of training is output to the client

    COSTS

    Boolean variables

    CPU

    Boolean variables

    DETAIL

    Boolean variables , Unavailable .

    NODES

    Boolean variables , Unavailable

    NUM_NODES

    Boolean variables , Unavailable

    BUFFERS

    Boolean variables

    TIMING

    Boolean variables

    PLAN

    Boolean variables

    FORMAT

    Optional format types :TEXT / XML / JSON / YAML

    Example :

    openGauss=# Explain CREATE MODEL patient_logisitic_regression USING logistic_regression FEATURES second_attack, treatment TARGET trait_anxiety > 50 FROM patients WITH batch_size=10, learning_rate = 0.05;
                                   QUERY PLAN
    -------------------------------------------------------------------------
     Train Model - logistic_regression  (cost=0.00..0.00 rows=0 width=0)
       ->  Materialize  (cost=0.00..41.08 rows=1776 width=12)
             ->  Seq Scan on patients  (cost=0.00..32.20 rows=1776 width=12)
    (3 rows)
    
  6. Abnormal scenario .

    • Training phase .

      • Scene one : When the setting of super parameter exceeds the value range , Model training failed , return ERROR, And prompt the error , for example :

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety  FROM patients WITH optimizer='aa';
        ERROR:  Invalid hyperparameter value for optimizer. Valid values are: gd, ngd.
        
      • Scene two : When the model name already exists , Model saving failed , return ERROR, And prompt the error reason , for example :

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety  FROM patients;
        ERROR:  The model name "patient_linear_regression" already exists in gs_model_warehouse.
        
      • Scene three :FEATURE perhaps TARGETS The column is *, return ERROR, And prompt the error reason , for example :

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES *  TARGET trait_anxiety  FROM patients;
        ERROR:  FEATURES clause cannot be *
        -----------------------------------------------------------------------------------------------------------------------
        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET *  FROM patients;
        ERROR:  TARGET clause cannot be *
        
      • Scene 4 : For unsupervised learning methods, use TARGET keyword , Or not applicable in supervised learning methods TARGET keyword , Will return ERROR, And prompt the error reason , for example :

        openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment FROM patients;
        ERROR:  Supervised ML algorithms require TARGET clause
        -----------------------------------------------------------------------------------------------------------------------------
        CREATE MODEL patient_linear_regression USING linear_regression TARGET trait_anxiety  FROM patients;   
        ERROR:  Supervised ML algorithms require FEATURES clause
        
      • Scene five : When performing classification tasks TARGET The classification of columns is only 1 In this case , Returns the ERROR, And prompt the error reason , for example :

        openGauss=# CREATE MODEL ecoli_svmc USING multiclass FEATURES f1, f2, f3, f4, f5, f6, f7 TARGET cat FROM (SELECT * FROM db4ai_ecoli WHERE cat='cp');
        ERROR:  At least two categories are needed
        
      • Scene six :DB4AI Data with null values will be filtered out during training , When the model data participating in training is empty , Returns the ERROR, And prompt the error reason , for example :

        openGauss=# create model iris_classification_model using xgboost_regression_logistic features message_regular target error_level from error_code;
        ERROR:  Training data is empty, please check the input data.
        
      • Scene seven :DB4AI There are restrictions on the types of data supported . When the data type is not in the support whitelist , Returns the ERROR, And prompt illegal oid, It can be done by pg_type see OID Determine illegal data type , for example :

        openGauss=# CREATE MODEL ecoli_svmc USING multiclass FEATURES f1, f2, f3, f4, f5, f6, f7, cat TARGET cat FROM db4ai_ecoli ;
        ERROR:  Oid type 1043 not yet supported
        
      • Scene 8 : When GUC Parameters statement_timeout Set the duration , The statement executed after training timeout will be terminated : perform CREATE MODEL sentence . The size of the training set 、 Number of training rounds (iteration)、 Early termination conditions (tolerance、max_seconds)、 Number of parallel threads (nthread) And other parameters will affect the training duration . At that time, the length exceeded the database limit , Statement terminated, model training failed .

    • Analytical model .

      • Scene nine : When the model name cannot be found in the system table , The database will report ERROR, for example :

        openGauss=# select gs_explain_model("ecoli_svmc");
        ERROR:  column "ecoli_svmc" does not exist
        
    • Inference stage .

      • Scene 10 : When the model name cannot be found in the system table , The database will report ERROR, for example :

        openGauss=# select id, PREDICT BY patient_logistic_regression (FEATURES second_attack,treatment) FROM patients;
        ERROR:  There is no model called "patient_logistic_regression".
        
      • Scene 11 : As an inference task FEATURES The data dimension and data type of are inconsistent with the training set , Will report ERROR, And prompt the error reason , for example :

        openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES second_attack) FROM patients;
        ERROR:  Invalid number of features for prediction, provided 1, expected 2
        CONTEXT:  referenced column: patient_linear_regression_pred
        -------------------------------------------------------------------------------------------------------------------------------------
        openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES 1,second_attack,treatment) FROM patients;
        ERROR:  Invalid number of features for prediction, provided 3, expected 2
        CONTEXT:  referenced column: patient_linear_regression_pre
        

explain : DB4AI The feature needs to read data to participate in the calculation , It is not applicable to dense database and other situations . 

Two 、 The whole process AI

Conventional AI Tasks often have multiple processes , For example, the data collection process includes data collection 、 Data cleaning 、 Data storage, etc , In the process of algorithm training, it also includes data preprocessing 、 Training 、 Model preservation and management, etc . among , For the training process of the model , It also includes the optimization process of super parameters . The whole process of the life cycle of such machine learning models , It can be largely integrated into the database . The training of the model is carried out at the place closest to the data storage side 、 management 、 Optimize and other processes , Provide... On the database side SQL Declarative out of the box AI The function of full claim cycle management , Call it the whole process AI.

openGauss Part of the whole process is realized AI The function of , It will be expanded in detail in this chapter .

  • PLPython Fenced Pattern

  • DB4AI-Snapshots Data version management

①PLPython Fenced Pattern

stay fenced Add... To the pattern plpython Non secure language . When compiling the database, you need to python Integrate into the database , stay configure Stage join –with-python Options . Installation can also be specified plpython Of python route , Add options –with-includes='/python-dir=path'.

Configure before starting the database GUC Parameters unix_socket_directory , Appoint unix_socket File address of interprocess communication . Users need to be in advance user-set-dir-path Create folder , And modify the folder permissions to be readable, writable and executable .

unix_socket_directory = '/user-set-dir-path'

Configuration complete , Start database .

take plpython Add database compilation , And set it up. GUC Parameters unix_socket_directory after , In the process of starting the database , Automatically create fenced-Master process . Not in the database python In case of compilation ,fenced The mode needs to be pulled up manually master process , stay GUC After setting the parameters , Input create master Process command .

start-up fenced-Master process , The order is :

gaussdb --fenced -k /user-set-dir-path -D /user-set-dir-path &

complete fence Mode configuration , in the light of plpython-fenced UDF The database will be fenced-worker In process execution UDF Calculation .

Instructions for use

  • establish extension

    • When compiled plpython by python2 when :

      openGauss=# create extension plpythonu;
      CREATE EXTENSION
      
    • When compiled plpython by python3 when :

      openGauss=# create extension plpython3u;
      CREATE EXTENSION
      

    The following example is based on python2 For example .

  • establish plpython-fenced UDF

    openGauss=# create or replace function pymax(a int, b int)
    openGauss-# returns INT
    openGauss-# language plpythonu fenced
    openGauss-# as $$
    openGauss$# import numpy
    openGauss$# if a > b:
    openGauss$#     return a;
    openGauss$# else:
    openGauss$#     return b;
    openGauss$# $$;
    CREATE FUNCTION
    
  • see UDF Information

    openGauss=# select * from pg_proc where proname='pymax';
    -[ RECORD 1 ]----+--------------
    proname          | pymax
    pronamespace     | 2200
    proowner         | 10
    prolang          | 16388
    procost          | 100
    prorows          | 0
    provariadic      | 0
    protransform     | -
    proisagg         | f
    proiswindow      | f
    prosecdef        | f
    proleakproof     | f
    proisstrict      | f
    proretset        | f
    provolatile      | v
    pronargs         | 2
    pronargdefaults  | 0
    prorettype       | 23
    proargtypes      | 23 23
    proallargtypes   |
    proargmodes      |
    proargnames      | {a,b}
    proargdefaults   |
    prosrc           |
                     | import numpy
                     | if a > b:
                     |     return a;
                     | else:
                     |     return b;
                     |
    probin           |
    proconfig        |
    proacl           |
    prodefaultargpos |
    fencedmode       | t
    proshippable     | f
    propackage       | f
    prokind          | f
    proargsrc        |
    
  • function UDF

    • Create a data table :

      openGauss=# create table temp (a int ,b int) ;
      CREATE TABLE
      openGauss=# insert into temp values (1,2),(2,3),(3,4),(4,5),(5,6);
      INSERT 0 5
      
    • function UDF:

      openGauss=# select pymax(a,b) from temp;
       pymax
      -------
           2
           3
           4
           5
           6
      (5 rows)

②DB4AI-Snapshots Data version management

DB4AI-Snapshots yes DB4AI Module is used to manage the function of dataset version . adopt DB4ai-Snapshots Components , Developers can simply 、 Fast feature filtering 、 Data preprocessing operations such as type conversion , At the same time, it can also be like git Version control the training data set as before . After the data table snapshot is created successfully, it can be used like a view , But once released , The snapshot of the data table is solidified into immutable static data , To modify the contents of the data table snapshot , You need to create a snapshot of a new data table with a different version number .

DB4AI-Snapshots Life cycle of

DB4AI-Snapshots The states of include published、archived as well as purged. among ,published Can be used to mark the DB4AI-Snapshots Has been released , It can be used .archived At present DB4AI-Snapshots be in “ Filing period ”, Generally, the training of new models is not carried out , Instead, use the old data to verify the new model .purged It's time to DB4AI-Snapshots The state that has been deleted , Can no longer be retrieved in the database system .

It should be noted that the snapshot management function is to provide users with unified training data , Different team members can use the given training data to retrain the machine learning model , Facilitate collaboration among users . So Private user and Separation of powers state (enableSeparationOfDuty=ON) And other situations that do not support user data transfer will not support Snapshot characteristic .

The user can go through “CREATE SNAPSHOT” Statement to create a snapshot of the data table , The created snapshot defaults to published state . There are two modes to create data table snapshots , That is to say MSS as well as CSS Pattern , They can go through GUC Parameters db4ai_snapshot_mode To configure . about MSS Pattern , It is realized by materialization algorithm , The data entity that stores the original dataset ;CSS It is based on the relative calculation algorithm , What is stored is the incremental information of data . Meta information of data table snapshot is stored in DB4AI In the system directory of . Can pass db4ai.snapshot Check the system table and see .

Can pass “ARCHIVE SNAPSHOT” Statement marks a snapshot of a data table as archived state , Can pass “PUBLISH SNAPSHOT” Statement marks it again as published state . Mark the status of the data table snapshot , It is designed to help data scientists work together as a team .

When a data table snapshot has lost its value , Can pass “PURGE SNAPSHOT” Statement to delete it , To permanently delete its data and restore storage space .

DB4AI-Snapshots Instructions for use

  1. Creating tables and inserting table data .

    There are existing data tables in the database , The corresponding data table snapshot can be created according to the existing data table . For subsequent demonstrations , Create a new one named t1 Data sheet for , And insert test data into it .

    create table t1 (id int, name varchar);
    insert into t1 values (1, 'zhangsan');
    insert into t1 values (2, 'lisi');
    insert into t1 values (3, 'wangwu');
    insert into t1 values (4, 'lisa');
    insert into t1 values (5, 'jack');
    

    adopt SQL sentence , Query the content of matching data table .

    SELECT * FROM t1;
    id |   name
    ----+----------
      1 | zhangsan
      2 | lisi
      3 | wangwu
      4 | lisa
      5 | jack
    (5 rows)
    
  2. Use DB4AI-Snapshots.

    • establish DB4AI-Snapshots

      • Example 1:CREATE SNAPSHOT…AS

        Examples are as follows , among , The default version separator is “@”, The default subversion delimiter is “.”, The delimiter can be separated by GUC Parameters db4ai_snapshot_version_delimiter as well as db4ai_snapshot_version_separator Set it up .

        create snapshot [email protected] comment is 'first version' as select * from t1;
        schema |  name
        --------+--------
         public | [email protected]
        (1 row)
        

        The above results suggest that the data table has been created s1 Snapshot , The version number is 1.0. The created data table snapshot can be queried like a normal view , But it does not support the adoption of “INSERT INTO” Statement to update . For example, the following statements can query data table snapshots s1 The corresponding version of 1.0 The content of :

        SELECT * FROM [email protected];
        SELECT * FROM [email protected];
        SELECT * FROM public . s1 @ 1.0;
        id |   name
        ----+----------
          1 | zhangsan
          2 | lisi
          3 | wangwu
          4 | lisa
          5 | jack
        (5 rows)
        

        You can use the following SQL Statement to modify the data table t1 The content of :

        UPDATE t1 SET name = 'tom' where id = 4;
        insert into t1 values (6, 'john');
        insert into t1 values (7, 'tim');
        

        Then retrieve the data table t1 When , It was found that although the data sheet t1 The content of has changed , But data table snapshots [email protected] The query results of the version have not changed . Because of the data sheet t1 Our data has changed , If the content of the current data table is used as the version 2.0, You can create a snapshot [email protected], Created SQL The statement is as follows :

        create snapshot [email protected] as select * from t1;
        

        By the above example , We can find out , Snapshot of data table can solidify the contents of data table , Avoid the instability of machine learning model training caused by the change of data in the middle , At the same time, it can avoid multiple users accessing at the same time 、 Lock conflict caused by modifying the same table .

      • Example 2:CREATE SNAPSHOT…FROM

        SQL Statement can inherit a snapshot of a data table that has been created , Using the data modification on this basis, a new data table snapshot is generated . for example :

        create snapshot [email protected] from @1.0 comment is 'inherits from @1.0' using (INSERT VALUES(6, 'john'), (7, 'tim'); DELETE WHERE id = 1);
        schema |  name
        --------+--------
         public | [email protected]
        (1 row)
        

        among ,“@” The version separator for the snapshot of the data table ,from Clause is followed by an existing data table snapshot , Usage for “@”+ Version number ,USING After keyword, add several optional operation keywords (INSERT …/UPDATE …/DELETE …/ALTER …), among “INSERT INTO” as well as “DELETE FROM” Statement “INTO”、“FROM” The clauses associated with the snapshot name of the data table can be omitted , For details, please refer to AI Characteristic function .

        Example , Based on the foregoing [email protected] snapshot , Insert 2 Data , Delete 1 New data , Newly generated snapshot [email protected], Retrieve this [email protected]

        SELECT * FROM [email protected];
        id |   name
        ----+----------
          2 | lisi
          3 | wangwu
          4 | lisa
          5 | jack
          6 | john
          7 | tim
        (7 rows)
        
    • Delete data table snapshot SNAPSHOT

      purge snapshot [email protected];
      schema |  name
      --------+--------
       public | [email protected]
      (1 row)
      

      here , There is no way from [email protected] Data retrieved from , At the same time, the data table snapshot is in db4ai.snapshot The records in the view will also be cleared . Deleting the data table snapshot of this version will not affect the data table snapshot of other versions .

    • Sample from data table snapshot

      Example : from snapshot s1 Extract data from , Use 0.5 Sampling rate .

      sample snapshot [email protected] stratify by name as nick at ratio .5;
      schema |    name
      --------+------------
       public | [email protected]
      (1 row)
      

      You can use this function to create training sets and test sets , for example :

      SAMPLE SNAPSHOT [email protected]  STRATIFY BY name AS _test AT RATIO .2, AS _train AT RATIO .8 COMMENT IS 'training';
      schema |      name
      --------+----------------
       public | [email protected]
       public | [email protected]
      (2 rows)
      
    • Publish data table snapshot

      Use the following SQL Statement takes a snapshot of the data table [email protected] Marked as published state :

      publish snapshot [email protected];
      schema |  name
      --------+--------
       public | [email protected]
      (1 row)
      
    • Archive data table snapshot

      The following statement can be used to mark the data table snapshot as archived state :

      archive snapshot [email protected];
      schema |  name
      --------+--------
       public | [email protected]
      (1 row)
      

      Can pass db4ai-snapshots Provides a view to view the status of the current data table snapshot and other information :

      select * from db4ai.snapshot;
      id | parent_id | matrix_id | root_id | schema |    name    | owner  |                 commands                 | comment | published | archived |          created           | row_count
      ----+-----------+-----------+---------+--------+------------+--------+------------------------------------------+---------+-----------+----------+----------------------------+-----------
        1 |           |           |       1 | public | [email protected]     | omm | {"select *","from t1 where id > 3",NULL} |         | t         | f        | 2021-04-17 09:24:11.139868 |         2
        2 |         1 |           |       1 | public | [email protected] | omm | {"SAMPLE nick .5 {name}"}                |         | f         | f        | 2021-04-17 10:02:31.73923  |         0
      
  3. Abnormal scenario

    • Data sheet or db4ai-snapshots When there is no .

      purge snapshot [email protected];
      publish snapshot [email protected];
      ---------
      ERROR:  snapshot public."[email protected]" does not exist
      CONTEXT:  PL/pgSQL function db4ai.publish_snapshot(name,name) line 11 at assignment
               
      archive snapshot [email protected];
      ----------
      ERROR:  snapshot public."[email protected]" does not exist
      CONTEXT:  PL/pgSQL function db4ai.archive_snapshot(name,name) line 11 at assignment
      
    • Delete snapshot when , There are other snapshot, Make sure to delete other snapshots that depend on this snapshot .

      purge snapshot [email protected];
      ERROR:  cannot purge root snapshot 'public."[email protected]"' having dependent snapshots
      HINT:  purge all dependent snapshots first
      CONTEXT:  referenced column: purge_snapshot_internal
      SQL statement "SELECT db4ai.purge_snapshot_internal(i_schema, i_name)"
      PL/pgSQL function db4ai.purge_snapshot(name,name) line 71 at PERFORM
      
  4. relevant GUC Parameters

    • db4ai_snapshot_mode:

      Snapshot Yes 2 Patterns :MSS( Materialization mode , Store data entities ) and CSS( Calculation mode , Store incremental information ).Snapshot Can be found in MSS and CSS Switch between snapshot modes , The default is MSS Pattern .

    • db4ai_snapshot_version_delimiter:

      This parameter is the data table snapshot version separator .“@” Default version separator for data table snapshot .

    • db4ai_snapshot_version_separator

      This parameter is the data table snapshot sub version separator .“.” Default version separator for data table snapshot .

  5. DB4AI Schema Data table snapshot details under db4ai.snapshot.

    openGauss=# \d db4ai.snapshot
                           Table "db4ai.snapshot"
      Column   |            Type             |         Modifiers
    -----------+-----------------------------+---------------------------
     id        | bigint                      |
     parent_id | bigint                      |
     matrix_id | bigint                      |
     root_id   | bigint                      |
     schema    | name                        | not null
     name      | name                        | not null
     owner     | name                        | not null
     commands  | text[]                      | not null
     comment   | text                        |
     published | boolean                     | not null default false
     archived  | boolean                     | not null default false
     created   | timestamp without time zone | default pg_systimestamp()
     row_count | bigint                      | not null
    Indexes:
        "snapshot_pkey" PRIMARY KEY, btree (schema, name) TABLESPACE pg_default
        "snapshot_id_key" UNIQUE CONSTRAINT, btree (id) TABLESPACE pg_default
    

  explain :  Namespace DB4AI Is the private domain of this function , Does not support the DB4AI Create a functional index in the command space of (functional index). 

 — END —

原网站

版权声明
本文为[Gauss squirrel Club]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111535103670.html