当前位置：网站首页>Db4ai: database driven AI

Db4ai: database driven AI

2022-06-11 15:48:00 【Gauss squirrel Club】

DB4AI It refers to the ability to use the database to drive AI Mission , Realize data storage 、 Isomorphism of technology stack . By integrating in the database AI Algorithm , Make openGauss Database native AI Calculation engine 、 Model management 、AI operator 、AI The ability to execute a plan , Provide users with universal benefits AI technology . Different from traditional AI Modeling process ,DB4AI“ "One-stop" work style ” Modeling can solve the problem of repeated flow of data on various platforms , At the same time, simplify the development process , And the optimal execution path can be planned through the database , Let developers focus more on the tuning of specific business and models , It has the ease of use and performance advantages that similar products do not have .

One 、 Native DB4AI engine

Two 、 The whole process AI

One 、 Native DB4AI engine

openGauss The current version supports native DB4AI Ability , By introducing native AI operator , Simplify the operation process , Make full use of the database optimizer 、 Optimization and execution capability of actuator , Obtain high-performance in database model training ability . More simplified model training and prediction process 、 Higher performance , So that developers can focus more on model tuning and data analysis in a shorter time , The fragmented technology stack and redundant code implementation are avoided .

Keyword parsing

surface 1 DB4AI Syntax and keywords

name		describe
grammar	CREATE MODEL	Create models and train , Also save the model .
	PREDICT BY	Use the existing model to infer .
	DROP MODEL	Delete the model .
keyword	TARGET	Training / Infer the target column name of the task .
	FEATURES	Training / Infer the data characteristic column name of the task .
	MODEL	Model name of training task .

Instructions for use

Overview of algorithms supported in this version .

Current version DB4AI The new supported algorithms are as follows ：

surface 2 Support algorithm

optimization algorithm	Algorithm
GD	logistic_regression
	linear_regression
	svm_classification
	PCA
	multiclass
Kmeans	kmeans
xgboost	xgboost_regression_logistic
	xgboost_binary_logistic
	xgboost_regression_squarederror
	xgboost_regression_gamma

Model training syntax description .

CREATE MODEL
Use “CREATE MODEL” Statements can be used to create and train models . model training SQL sentence , Select the public data set iris data set iris.

With multiclass For example , Train a model . from tb_iris The training set specifies sepal_length, sepal_width,petal_length,petal_widt For the feature column , Use multiclass Algorithm , Create and save the model iris_classification_model.

openGauss=# CREATE MODEL iris_classification_model USING xgboost_regression_logistic FEATURES sepal_length, sepal_width,petal_length,petal_width TARGET target_type < 2 FROM tb_iris_1 WITH nthread=4, max_depth=8;
MODEL CREATED. PROCESSED 1

In the above order ：

“CREATE MODEL” Statement is used to train and save the model .
USING Keyword specifies the algorithm name .
FEATURES Used to specify the characteristics of the training model , You need to add... According to the column name of the training data table .
TARGET Specify the training objectives of the model , It can be the column name of the data table required for training , It can also be an expression , for example : price > 10000.

WITH Used to specify the hyperparameters when training the model . When the super parameter is not set by the user , The frame will use the default value .

For different operators , The framework supports different hyperparametric combinations ：

surface 3 Operator supported hyperparameters

operator	Hyperparametric
GD (logistic_regression、linear_regression、svm_classification)	optimizer(char); verbose(bool); max_iterations(int); max_seconds(double); batch_size(int); learning_rate(double); decay(double); tolerance(double) among ,SVM Restricted hyperparameters lambda(double)
Kmeans	max_iterations(int); num_centroids(int); tolerance(double); batch_size(int); num_features(int); distance_function(char); seeding_function(char); verbose(int);seed(int)
GD(pca)	batch_size(int);max_iterations(int);max_seconds(int);tolerance(float8);verbose(bool);number_components(int);seed(int)
GD(multiclass)	classifier(char) Be careful ：multiclass Other types of hyperparameters depend on the class in the selected classifier
xgboost_regression_logistic、xgboost_binary_logistic、xgboost_regression_squarederror、xgboost_regression_gamma	batch_size(int);booster(char);tree_method(char);eval_metric(char*);seed(int);nthread(int);max_depth(int);gamma(float8);eta(float8);min_child_weight(int);verbosity(int)

The default values and value ranges of current super parameter settings are as follows ：

surface 4 The default value and value range of the super parameter

operator	Hyperparametric ( The default value is )	Value range	Hyperparametric description
GD: logistic_regression、linear_regression、svm_classification、pca	optimizer = gd（ Gradient descent method ）	gd/ngd（ Natural gradient descent ）	Optimizer
	verbose = false	T/F	The log shows
	max_iterations = 100	(0, 10000]	Maximum number of iterations
	max_seconds = 0 ( No restrictions on run-time length )	[0,INT_MAX_VALUE]	Run time
	batch_size = 1000	(0, 1048575]	The number of samples selected for a training
	learning_rate = 0.8	(0, DOUBLE_MAX_VALUE]	Learning rate
	decay = 0.95	(0, DOUBLE_MAX_VALUE]	Weight decay rate
	tolerance = 0.0005	(0, DOUBLE_MAX_VALUE]	tolerance
	seed = 0（ Yes seed Take random values ）	[0, INT_MAX_VALUE]	seeds
	just for linear、SVM：kernel = “linear”	linear/gaussian/polynomial	Kernel function
	just for linear、SVM：components = MAX(2*features, 128)	[0, INT_MAX_VALUE]	Dimension of high-dimensional space
	just for linear、SVM：gamma = 0.5	(0, DOUBLE_MAX_VALUE]	gaussian Kernel function parameters
	just for linear、SVM：degree = 2	[2, 9]	polynomial Kernel function parameters
	just for linear、SVM：coef0 = 1.0	[0, DOUBLE_MAX_VALUE]	polynomial The parameters of the kernel function
	just for SVM：lambda = 0.01	(0, DOUBLE_MAX_VALUE)	Regularization parameters
	just for pca： number_components	（0,INT_MAX_VALUE]	Target dimension of dimensionality reduction
GD: multiclass	classifier=“svm_classification”	svm_classification\logistic_regression	Classifier for multi classification task
Kmeans	max_iterations = 10	[1, 10000]	Maximum number of iterations
	num_centroids = 10	[1, 1000000]	Number of clusters
	tolerance = 0.00001	(0,1]	Center point error
	batch_size = 10	[1,1048575]	The number of samples selected for a training
	num_features = 2	[1, INT_MAX_VALUE]	Enter the number of sample features
	distance_function = “L2_Squared”	L1\L2\L2_Squared\Linf	Regularization method
	seeding_function = “Random++”	“Random++”\“KMeans\|\|”	Initialize seed point method
	verbose = 0U	{ 0, 1, 2 }	Lengthy mode
	seed = 0U	[0, INT_MAX_VALUE]	seeds
xgboost: xgboost_regression_logistic、 xgboost_binary_logistic、 xgboost_regression_gamma、xgboost_regression_squarederror	n_iter=10	(0, 10000]	The number of iterations
	batch_size=10000	(0, 1048575]	The number of samples selected for a training
	booster=“gbtree”	gbtree\gblinear\dart	booster species
	tree_method=“auto”	auto\exact\approx\hist\gpu_hist Be careful ：gpu_hist The parameter requires the corresponding library GPU edition , otherwise DB4AI The platform does not support this value .	Tree building algorithm
	eval_metric=“rmse”	rmse\rmsle\map\mae\auc\aucpr	Evaluation indicators for validation data
	seed=0	[0, 100]	seeds
	nthread=1	(0, MAX_MEMORY_LIMIT]	Concurrency
	max_depth=5	(0, MAX_MEMORY_LIMIT]	The maximum depth of the tree , This hyperparameter is only for tree type booster take effect .
	gamma=0.0	[0, 1]	The minimum loss required for further zoning on the blade node is reduced
	eta=0.3	[0, 1]	The step size used in the update shrinks , To prevent over fitting
	min_child_weight=1	[0, INT_MAX_VALUE]	The minimum sum of the instance weights required in the child node
	verbosity=1	0 (silent)\1 (warning)\2 (info)\3 (debug)	The level of detail of the printed information
MAX_MEMORY_LIMIT = Maximum number of tuples loaded in memory
GS_MAX_COLS = The maximum number of attributes in a single database table

The model was saved successfully , Then the creation success information is returned ：
```
MODEL CREATED. PROCESSED x
```

View model information .

When the training is completed, the model will be stored in the system table gs_model_warehouse in . The system tables gs_model_warehouse You can view relevant information about the model itself and the training process .

The detailed description information about the model is stored in the system table in binary form , The user has used functions gs_explain_model Finish viewing the model , The statement is as follows ：

openGauss=# select * from gs_explain_model("iris_classification_model");
 DB4AI MODEL
-------------------------------------------------------------
 Name: iris_classification_model
 Algorithm: xgboost_regression_logistic
 Query: CREATE MODEL iris_classification_model
 USING xgboost_regression_logistic
 FEATURES sepal_length, sepal_width,petal_length,petal_width
 TARGET target_type < 2
 FROM tb_iris_1
 WITH nthread=4, max_depth=8;
 Return type: Float64
 Pre-processing time: 0.000000
 Execution time: 0.001443
 Processed tuples: 78
 Discarded tuples: 0
 n_iter: 10
 batch_size: 10000
 max_depth: 8
 min_child_weight: 1
 gamma: 0.0000000000
 eta: 0.3000000000
 nthread: 4
 verbosity: 1
 seed: 0
 booster: gbtree
 tree_method: auto
 eval_metric: rmse
 rmse: 0.2648450136
 model size: 4613

Use existing models to do inference tasks .
Use “SELECT” and “PREDICT BY” Keyword use the existing model to complete the inference task .
The query syntax ：SELECT…PREDICT BY…(FEATURES…)…FROM…;
```
openGauss=# SELECT id, PREDICT BY iris_classification (FEATURES sepal_length,sepal_width,petal_length,petal_width) as "PREDICT" FROM tb_iris limit 3;
     
id  | PREDICT
-----+---------
  84 |       2
  85 |       0
  86 |       0
(3 rows)
```
For the same inference task , The results of the same model are roughly stable . The model trained based on the same super parameters and training set also has stability , meanwhile AI There are random components in model training （ Every batch Data distribution of 、 Stochastic gradient descent ）, So the calculation performance between different models 、 The results allow for small differences .

View execution plan .

Use explain Statement can be used for “CREATE MODEL” and “PREDICT BY” The execution plan in the process of model training or prediction is analyzed .Explain Keywords can be directly spliced CREATE MODEL/ PREDICT BY sentence （ Clause ）, Optional parameters can also be connected , The supported parameters are as follows ：

surface 5 EXPLAIN Supported parameters

Parameter name	describe
ANALYZE	Boolean variables , Additional running time 、 Number of cycles and other descriptive information
VERBOSE	Boolean variables , Control whether the running information of training is output to the client
COSTS	Boolean variables
CPU	Boolean variables
DETAIL	Boolean variables , Unavailable .
NODES	Boolean variables , Unavailable
NUM_NODES	Boolean variables , Unavailable
BUFFERS	Boolean variables
TIMING	Boolean variables
PLAN	Boolean variables
FORMAT	Optional format types ：TEXT / XML / JSON / YAML

Example ：

openGauss=# Explain CREATE MODEL patient_logisitic_regression USING logistic_regression FEATURES second_attack, treatment TARGET trait_anxiety > 50 FROM patients WITH batch_size=10, learning_rate = 0.05;
                               QUERY PLAN
-------------------------------------------------------------------------
 Train Model - logistic_regression  (cost=0.00..0.00 rows=0 width=0)
   ->  Materialize  (cost=0.00..41.08 rows=1776 width=12)
         ->  Seq Scan on patients  (cost=0.00..32.20 rows=1776 width=12)
(3 rows)

Abnormal scenario .

Training phase .

Scene one ： When the setting of super parameter exceeds the value range , Model training failed , return ERROR, And prompt the error , for example ：

openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety  FROM patients WITH optimizer='aa';
ERROR:  Invalid hyperparameter value for optimizer. Valid values are: gd, ngd.

Scene two ： When the model name already exists , Model saving failed , return ERROR, And prompt the error reason , for example ：

openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET trait_anxiety  FROM patients;
ERROR:  The model name "patient_linear_regression" already exists in gs_model_warehouse.

Scene three ：FEATURE perhaps TARGETS The column is *, return ERROR, And prompt the error reason , for example ：

openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES *  TARGET trait_anxiety  FROM patients;
ERROR:  FEATURES clause cannot be *
-----------------------------------------------------------------------------------------------------------------------
openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment TARGET *  FROM patients;
ERROR:  TARGET clause cannot be *

Scene 4 ： For unsupervised learning methods, use TARGET keyword , Or not applicable in supervised learning methods TARGET keyword , Will return ERROR, And prompt the error reason , for example ：

openGauss=# CREATE MODEL patient_linear_regression USING linear_regression FEATURES second_attack,treatment FROM patients;
ERROR:  Supervised ML algorithms require TARGET clause
-----------------------------------------------------------------------------------------------------------------------------
CREATE MODEL patient_linear_regression USING linear_regression TARGET trait_anxiety  FROM patients;   
ERROR:  Supervised ML algorithms require FEATURES clause

Scene five ： When performing classification tasks TARGET The classification of columns is only 1 In this case , Returns the ERROR, And prompt the error reason , for example ：

openGauss=# CREATE MODEL ecoli_svmc USING multiclass FEATURES f1, f2, f3, f4, f5, f6, f7 TARGET cat FROM (SELECT * FROM db4ai_ecoli WHERE cat='cp');
ERROR:  At least two categories are needed

Scene six ：DB4AI Data with null values will be filtered out during training , When the model data participating in training is empty , Returns the ERROR, And prompt the error reason , for example ：

openGauss=# create model iris_classification_model using xgboost_regression_logistic features message_regular target error_level from error_code;
ERROR:  Training data is empty, please check the input data.

Scene seven ：DB4AI There are restrictions on the types of data supported . When the data type is not in the support whitelist , Returns the ERROR, And prompt illegal oid, It can be done by pg_type see OID Determine illegal data type , for example ：
```
openGauss=# CREATE MODEL ecoli_svmc USING multiclass FEATURES f1, f2, f3, f4, f5, f6, f7, cat TARGET cat FROM db4ai_ecoli ;
ERROR:  Oid type 1043 not yet supported
```
Scene 8 ： When GUC Parameters statement_timeout Set the duration , The statement executed after training timeout will be terminated ： perform CREATE MODEL sentence . The size of the training set 、 Number of training rounds (iteration)、 Early termination conditions (tolerance、max_seconds)、 Number of parallel threads (nthread) And other parameters will affect the training duration . At that time, the length exceeded the database limit , Statement terminated, model training failed .

Analytical model .
- Scene nine ： When the model name cannot be found in the system table , The database will report ERROR, for example ：
```
openGauss=# select gs_explain_model("ecoli_svmc");
ERROR:  column "ecoli_svmc" does not exist
```

Inference stage .

Scene 10 ： When the model name cannot be found in the system table , The database will report ERROR, for example ：

openGauss=# select id, PREDICT BY patient_logistic_regression (FEATURES second_attack,treatment) FROM patients;
ERROR:  There is no model called "patient_logistic_regression".

Scene 11 ： As an inference task FEATURES The data dimension and data type of are inconsistent with the training set , Will report ERROR, And prompt the error reason , for example ：

openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES second_attack) FROM patients;
ERROR:  Invalid number of features for prediction, provided 1, expected 2
CONTEXT:  referenced column: patient_linear_regression_pred
-------------------------------------------------------------------------------------------------------------------------------------
openGauss=# select id, PREDICT BY patient_linear_regression (FEATURES 1,second_attack,treatment) FROM patients;
ERROR:  Invalid number of features for prediction, provided 3, expected 2
CONTEXT:  referenced column: patient_linear_regression_pre

explain ： DB4AI The feature needs to read data to participate in the calculation , It is not applicable to dense database and other situations .

Two 、 The whole process AI

Conventional AI Tasks often have multiple processes , For example, the data collection process includes data collection 、 Data cleaning 、 Data storage, etc , In the process of algorithm training, it also includes data preprocessing 、 Training 、 Model preservation and management, etc . among , For the training process of the model , It also includes the optimization process of super parameters . The whole process of the life cycle of such machine learning models , It can be largely integrated into the database . The training of the model is carried out at the place closest to the data storage side 、 management 、 Optimize and other processes , Provide... On the database side SQL Declarative out of the box AI The function of full claim cycle management , Call it the whole process AI.

openGauss Part of the whole process is realized AI The function of , It will be expanded in detail in this chapter .

PLPython Fenced Pattern
DB4AI-Snapshots Data version management

①PLPython Fenced Pattern

stay fenced Add... To the pattern plpython Non secure language . When compiling the database, you need to python Integrate into the database , stay configure Stage join –with-python Options . Installation can also be specified plpython Of python route , Add options –with-includes='/python-dir=path'.

Configure before starting the database GUC Parameters unix_socket_directory , Appoint unix_socket File address of interprocess communication . Users need to be in advance user-set-dir-path Create folder , And modify the folder permissions to be readable, writable and executable .

unix_socket_directory = '/user-set-dir-path'

Configuration complete , Start database .

take plpython Add database compilation , And set it up. GUC Parameters unix_socket_directory after , In the process of starting the database , Automatically create fenced-Master process . Not in the database python In case of compilation ,fenced The mode needs to be pulled up manually master process , stay GUC After setting the parameters , Input create master Process command .

start-up fenced-Master process , The order is ：

gaussdb --fenced -k /user-set-dir-path -D /user-set-dir-path &

complete fence Mode configuration , in the light of plpython-fenced UDF The database will be fenced-worker In process execution UDF Calculation .

Instructions for use

establish extension
- When compiled plpython by python2 when ：
```
openGauss=# create extension plpythonu;
CREATE EXTENSION
```
- When compiled plpython by python3 when ：
```
openGauss=# create extension plpython3u;
CREATE EXTENSION
```
The following example is based on python2 For example .

establish plpython-fenced UDF

openGauss=# create or replace function pymax(a int, b int)
openGauss-# returns INT
openGauss-# language plpythonu fenced
openGauss-# as $$
openGauss$# import numpy
openGauss$# if a > b:
openGauss$#     return a;
openGauss$# else:
openGauss$#     return b;
openGauss$# $$;
CREATE FUNCTION

see UDF Information

openGauss=# select * from pg_proc where proname='pymax';
-[ RECORD 1 ]----+--------------
proname          | pymax
pronamespace     | 2200
proowner         | 10
prolang          | 16388
procost          | 100
prorows          | 0
provariadic      | 0
protransform     | -
proisagg         | f
proiswindow      | f
prosecdef        | f
proleakproof     | f
proisstrict      | f
proretset        | f
provolatile      | v
pronargs         | 2
pronargdefaults  | 0
prorettype       | 23
proargtypes      | 23 23
proallargtypes   |
proargmodes      |
proargnames      | {a,b}
proargdefaults   |
prosrc           |
                 | import numpy
                 | if a > b:
                 |     return a;
                 | else:
                 |     return b;
                 |
probin           |
proconfig        |
proacl           |
prodefaultargpos |
fencedmode       | t
proshippable     | f
propackage       | f
prokind          | f
proargsrc        |

function UDF

Create a data table ：

openGauss=# create table temp (a int ,b int) ;
CREATE TABLE
openGauss=# insert into temp values (1,2),(2,3),(3,4),(4,5),(5,6);
INSERT 0 5

function UDF：

openGauss=# select pymax(a,b) from temp;
 pymax
-------
     2
     3
     4
     5
     6
(5 rows)

②DB4AI-Snapshots Data version management

DB4AI-Snapshots yes DB4AI Module is used to manage the function of dataset version . adopt DB4ai-Snapshots Components , Developers can simply 、 Fast feature filtering 、 Data preprocessing operations such as type conversion , At the same time, it can also be like git Version control the training data set as before . After the data table snapshot is created successfully, it can be used like a view , But once released , The snapshot of the data table is solidified into immutable static data , To modify the contents of the data table snapshot , You need to create a snapshot of a new data table with a different version number .

DB4AI-Snapshots Life cycle of

DB4AI-Snapshots The states of include published、archived as well as purged. among ,published Can be used to mark the DB4AI-Snapshots Has been released , It can be used .archived At present DB4AI-Snapshots be in “ Filing period ”, Generally, the training of new models is not carried out , Instead, use the old data to verify the new model .purged It's time to DB4AI-Snapshots The state that has been deleted , Can no longer be retrieved in the database system .

It should be noted that the snapshot management function is to provide users with unified training data , Different team members can use the given training data to retrain the machine learning model , Facilitate collaboration among users . So Private user and Separation of powers state (enableSeparationOfDuty=ON) And other situations that do not support user data transfer will not support Snapshot characteristic .

The user can go through “CREATE SNAPSHOT” Statement to create a snapshot of the data table , The created snapshot defaults to published state . There are two modes to create data table snapshots , That is to say MSS as well as CSS Pattern , They can go through GUC Parameters db4ai_snapshot_mode To configure . about MSS Pattern , It is realized by materialization algorithm , The data entity that stores the original dataset ;CSS It is based on the relative calculation algorithm , What is stored is the incremental information of data . Meta information of data table snapshot is stored in DB4AI In the system directory of . Can pass db4ai.snapshot Check the system table and see .

Can pass “ARCHIVE SNAPSHOT” Statement marks a snapshot of a data table as archived state , Can pass “PUBLISH SNAPSHOT” Statement marks it again as published state . Mark the status of the data table snapshot , It is designed to help data scientists work together as a team .

When a data table snapshot has lost its value , Can pass “PURGE SNAPSHOT” Statement to delete it , To permanently delete its data and restore storage space .

DB4AI-Snapshots Instructions for use

Creating tables and inserting table data .
There are existing data tables in the database , The corresponding data table snapshot can be created according to the existing data table . For subsequent demonstrations , Create a new one named t1 Data sheet for , And insert test data into it .
```
create table t1 (id int, name varchar);
insert into t1 values (1, 'zhangsan');
insert into t1 values (2, 'lisi');
insert into t1 values (3, 'wangwu');
insert into t1 values (4, 'lisa');
insert into t1 values (5, 'jack');
```
adopt SQL sentence , Query the content of matching data table .
```
SELECT * FROM t1;
id |   name
----+----------
  1 | zhangsan
  2 | lisi
  3 | wangwu
  4 | lisa
  5 | jack
(5 rows)
```
Use DB4AI-Snapshots.
- establish DB4AI-Snapshots
  - Example 1：CREATE SNAPSHOT…AS
    Examples are as follows , among , The default version separator is “@”, The default subversion delimiter is “.”, The delimiter can be separated by GUC Parameters db4ai_snapshot_version_delimiter as well as db4ai_snapshot_version_separator Set it up .
```
create snapshot [email protected] comment is 'first version' as select * from t1;
schema |  name
--------+--------
 public | [email protected]
(1 row)
```
    The above results suggest that the data table has been created s1 Snapshot , The version number is 1.0. The created data table snapshot can be queried like a normal view , But it does not support the adoption of “INSERT INTO” Statement to update . For example, the following statements can query data table snapshots s1 The corresponding version of 1.0 The content of ：
```
SELECT * FROM [email protected];
SELECT * FROM [email protected];
SELECT * FROM public . s1 @ 1.0;
id |   name
----+----------
  1 | zhangsan
  2 | lisi
  3 | wangwu
  4 | lisa
  5 | jack
(5 rows)
```
    You can use the following SQL Statement to modify the data table t1 The content of ：
```
UPDATE t1 SET name = 'tom' where id = 4;
insert into t1 values (6, 'john');
insert into t1 values (7, 'tim');
```
    Then retrieve the data table t1 When , It was found that although the data sheet t1 The content of has changed , But data table snapshots [email protected] The query results of the version have not changed . Because of the data sheet t1 Our data has changed , If the content of the current data table is used as the version 2.0, You can create a snapshot [email protected], Created SQL The statement is as follows ：
```
create snapshot [email protected] as select * from t1;
```
    By the above example , We can find out , Snapshot of data table can solidify the contents of data table , Avoid the instability of machine learning model training caused by the change of data in the middle , At the same time, it can avoid multiple users accessing at the same time 、 Lock conflict caused by modifying the same table .
  - Example 2：CREATE SNAPSHOT…FROM
    SQL Statement can inherit a snapshot of a data table that has been created , Using the data modification on this basis, a new data table snapshot is generated . for example ：
```
create snapshot [email protected] from @1.0 comment is 'inherits from @1.0' using (INSERT VALUES(6, 'john'), (7, 'tim'); DELETE WHERE id = 1);
schema |  name
--------+--------
 public | [email protected]
(1 row)
```
    among ,“@” The version separator for the snapshot of the data table ,from Clause is followed by an existing data table snapshot , Usage for “@”+ Version number ,USING After keyword, add several optional operation keywords （INSERT …/UPDATE …/DELETE …/ALTER …）, among “INSERT INTO” as well as “DELETE FROM” Statement “INTO”、“FROM” The clauses associated with the snapshot name of the data table can be omitted , For details, please refer to AI Characteristic function .
    Example , Based on the foregoing [email protected] snapshot , Insert 2 Data , Delete 1 New data , Newly generated snapshot [email protected], Retrieve this [email protected]：
```
SELECT * FROM [email protected];
id |   name
----+----------
  2 | lisi
  3 | wangwu
  4 | lisa
  5 | jack
  6 | john
  7 | tim
(7 rows)
```
- Delete data table snapshot SNAPSHOT
```
purge snapshot [email protected];
schema |  name
--------+--------
 public | [email protected]
(1 row)
```
  here , There is no way from [email protected] Data retrieved from , At the same time, the data table snapshot is in db4ai.snapshot The records in the view will also be cleared . Deleting the data table snapshot of this version will not affect the data table snapshot of other versions .
- Sample from data table snapshot
  Example ： from snapshot s1 Extract data from , Use 0.5 Sampling rate .
```
sample snapshot [email protected] stratify by name as nick at ratio .5;
schema |    name
--------+------------
 public | [email protected]
(1 row)
```
  You can use this function to create training sets and test sets , for example ：
```
SAMPLE SNAPSHOT [email protected]  STRATIFY BY name AS _test AT RATIO .2, AS _train AT RATIO .8 COMMENT IS 'training';
schema |      name
--------+----------------
 public | [email protected]
 public | [email protected]
(2 rows)
```
- Publish data table snapshot
  Use the following SQL Statement takes a snapshot of the data table [email protected] Marked as published state ：
```
publish snapshot [email protected];
schema |  name
--------+--------
 public | [email protected]
(1 row)
```
- Archive data table snapshot
  The following statement can be used to mark the data table snapshot as archived state ：
```
archive snapshot [email protected];
schema |  name
--------+--------
 public | [email protected]
(1 row)
```
  Can pass db4ai-snapshots Provides a view to view the status of the current data table snapshot and other information ：
```
select * from db4ai.snapshot;
id | parent_id | matrix_id | root_id | schema |    name    | owner  |                 commands                 | comment | published | archived |          created           | row_count
----+-----------+-----------+---------+--------+------------+--------+------------------------------------------+---------+-----------+----------+----------------------------+-----------
  1 |           |           |       1 | public | [email protected]     | omm | {"select *","from t1 where id > 3",NULL} |         | t         | f        | 2021-04-17 09:24:11.139868 |         2
  2 |         1 |           |       1 | public | [email protected] | omm | {"SAMPLE nick .5 {name}"}                |         | f         | f        | 2021-04-17 10:02:31.73923  |         0
```

Abnormal scenario

Data sheet or db4ai-snapshots When there is no .

purge snapshot [email protected];
publish snapshot [email protected];
---------
ERROR:  snapshot public."[email protected]" does not exist
CONTEXT:  PL/pgSQL function db4ai.publish_snapshot(name,name) line 11 at assignment
         
archive snapshot [email protected];
----------
ERROR:  snapshot public."[email protected]" does not exist
CONTEXT:  PL/pgSQL function db4ai.archive_snapshot(name,name) line 11 at assignment

Delete snapshot when , There are other snapshot, Make sure to delete other snapshots that depend on this snapshot .

purge snapshot [email protected];
ERROR:  cannot purge root snapshot 'public."[email protected]"' having dependent snapshots
HINT:  purge all dependent snapshots first
CONTEXT:  referenced column: purge_snapshot_internal
SQL statement "SELECT db4ai.purge_snapshot_internal(i_schema, i_name)"
PL/pgSQL function db4ai.purge_snapshot(name,name) line 71 at PERFORM

relevant GUC Parameters
- db4ai_snapshot_mode：
  Snapshot Yes 2 Patterns ：MSS（ Materialization mode , Store data entities ） and CSS（ Calculation mode , Store incremental information ）.Snapshot Can be found in MSS and CSS Switch between snapshot modes , The default is MSS Pattern .
- db4ai_snapshot_version_delimiter：
  This parameter is the data table snapshot version separator .“@” Default version separator for data table snapshot .
- db4ai_snapshot_version_separator
  This parameter is the data table snapshot sub version separator .“.” Default version separator for data table snapshot .

DB4AI Schema Data table snapshot details under db4ai.snapshot.

openGauss=# \d db4ai.snapshot
                       Table "db4ai.snapshot"
  Column   |            Type             |         Modifiers
-----------+-----------------------------+---------------------------
 id        | bigint                      |
 parent_id | bigint                      |
 matrix_id | bigint                      |
 root_id   | bigint                      |
 schema    | name                        | not null
 name      | name                        | not null
 owner     | name                        | not null
 commands  | text[]                      | not null
 comment   | text                        |
 published | boolean                     | not null default false
 archived  | boolean                     | not null default false
 created   | timestamp without time zone | default pg_systimestamp()
 row_count | bigint                      | not null
Indexes:
    "snapshot_pkey" PRIMARY KEY, btree (schema, name) TABLESPACE pg_default
    "snapshot_id_key" UNIQUE CONSTRAINT, btree (id) TABLESPACE pg_default

explain ： Namespace DB4AI Is the private domain of this function , Does not support the DB4AI Create a functional index in the command space of （functional index）.