当前位置：网站首页>Introduction to the paper | application of machine learning in database cardinality estimation

Introduction to the paper | application of machine learning in database cardinality estimation

2022-07-02 19:04:00 【PKUMOD】

The definition and significance of cardinality estimation

Sometimes we also need complex queries on more than one table （ Contains multiple table joins （join）、 Range queries （range query） etc. ） Estimated base , These are the research contents of database cardinality estimation .

The role of cardinality estimation can be summarized into the following two aspects ：

（1） Cardinality estimation is an important part of database query optimization . Because the cardinality of the subquery represents the size of the intermediate result , Therefore, most of the existing cost based query optimization techniques strongly rely on the estimation of the subquery cardinality to estimate the subquery plan cost . An excellent cardinality estimation model can help the optimizer better select the appropriate execution plan .

（2） Cardinality estimation can help answer queries that only need to know the number of results . For example, a user may want to know how many college students in China study computer science , This query only cares about the number of results without knowing the specific value of each result . For such queries, we can quickly return the approximate value of the result size through the cardinality estimation method .

Classification of current methods

Cardinality estimation is an important and difficult problem , For decades, researchers have tried to use various methods and techniques to improve the accuracy and robustness of estimation , The existing methods can be divided into the following categories ：

In general, traditional methods and machine learning based methods can be classified . The following is a brief introduction to some representative works .

Traditional cardinality estimation method

Traditional cardinality estimation methods can be generally divided into summary based methods (synopsis-based) And sampling (sampling) There are two main types of methods of .

The profile based cardinality estimation method will collect some statistical information of the database in advance , And based on simple assumptions such as independence , It can easily and quickly solve the query cardinality .

For example, based on histogram (histogram) Methods , It summarizes each column in the data table into an equal width sheet (equal width) Or equal depth (equal depth) Histogram , Finally, according to the query conditions , Based on the assumption that columns are independent of each other , Estimate the size of the query results . Of course, sometimes such independence assumption does not work well , Based on the multidimensional histogram (multi-dimensional histogram) Method of solving .

Data portraits (sketching) It also belongs to the cardinality estimation method based on profile , Its use bitmap Or hash method , Estimate the number of different elements in the database , It is more suitable for streaming data scenarios . Methods for updating a class of data portraits, such as loglog、hyperloglog etc. , Pay more attention to saving memory and improving the robustness of estimation .

The sampling based method will randomly extract a certain proportion or a certain number of tuples from the original data table , Finally, the cardinality estimation of the query in the original database can be obtained by dividing the size of the query result on the sample set by the corresponding scaling ratio . The method based on sampling is better in sampling strategy 、 The effect is better when it can reflect the distribution of most data , However, it is also easy to fail to generate matching results on the sampling set due to the randomness of sampling , Some researchers have proposed an index based sampling method (index-based sampling) [1], When multiple tables are connected , The range to be sampled will be selected according to the query connection conditions , It can reduce the occurrence of zero results in sampling , The specific diagram is as follows ：

Query driven learning cardinality estimation method

Since the base estimate can be regarded as “ Inquire about ” To ” Base value ” The regression problem of ( When the database is not updated ), So we can transfer the skills of supervised learning to this problem , Learn the corresponding relationship of the cardinality directly . The workflow of the general query driven method is as follows ：

The difficulty lies in the acquisition of training data and the expression of query statements .

generally , When the database is given , We can assume that the query framework (schema) It is known. （ for example , According to which columns a certain two tables will be connected ）, From this, we can randomly generate some query statements , After these statements are actually run , We get the real cardinality of these queries , This constitutes the training data required by the training model . Of course, you can also get the query directly according to the query log of the database . It should be noted that , The training data must be representative , Otherwise, the model cannot learn the correspondence between common query patterns and cardinality in the working environment , Will affect the accuracy of the forecast .

Query statements are mainly represented by query tables 、 Predicate conditions 、 Connection conditions, etc , More complex cases include those with regular expressions “like” Inquire about .

Work [2] Applying regression model to estimate query cardinality . It summarizes user input into some semantically related classes , be called templates, For each template, There are some parameters that can be adjusted to represent the query statement , Different parameters make up different queries . The model of this work is relatively simple , It mainly tests the linear regression 、 Model tree and local weighted regression .

2019 year , at work [3] in , Researchers have proposed a multiset convolution network (multi-set convolutional network, MSCN). It represents the query as

In the form of triples , The elements represent the query table 、 Join condition and predicate selection , Are collections with corresponding structures . The model synthesizes tables 、 Connection and predicate condition information , And then through several layers of network , Get the final cardinality estimate . It should be noted that , The randomly generated training query can only contain two connection conditions at most , Researchers hope that the model can be automatically generalized to multiple connections by learning a small number of patterns . The overall architecture is as follows ：

Work [4] in , The author combines the cardinality estimation and the cost estimation of the query plan in the same end-to-end learning estimation model . It relies on the optimizer of the database system to generate a physical plan for the input query , The training data is determined by ( Query plan , Exact cardinality , Accurately plan the cost ) The form of triples constitutes . Because the query plan is a tree structure , Therefore, the model uses a tree LSTM. In short , Each node in the plan tree receives information from child nodes , And integrate it with your own information , Finally, the output of the query root nodes is sent to the two models of prediction base and prediction cost to get the desired estimation . In this work, we also put forward the relation according to predicate conditions （AND、OR etc. ）, Separate use MIN、MAX Pooling , For details, please refer to his thesis . The following figure shows the workflow of its model ：

Data driven learning Radix estimation method

Different from the methods of query driven classes , The data-driven cardinality estimation method is a kind of unsupervised method , Learn the distribution of data directly from the data table , So the training is simple .

at work [5] in , The authors use kernel density estimation (kernel density estimation, KDE) Method to estimate the query cardinality . The idea is to randomly select several tuples from the data table , The closer the query point is to the sampling tuple , Then the query is more likely to have a result . The author uses Gaussian kernel , The dispersion of probability density is controlled by learnable parameters . For the database update scenario , The author uses reservoir sampling (reservoir sampling) Data should be inserted ; By monitoring the impact of removing a certain sampling tuple on the estimation effect, we should deal with data deletion （ That is, deleting a tuple can improve the estimation effect , Delete it ）. Because the calculation only needs to calculate the probability of each sample tuple and add it , So we can pass GPU Accelerate Computing . The author is also in the following research , The kernel density estimation method is extended to the cardinality estimation of multi table connection [6]. The following figure shows the main idea of kernel density estimation ：

There is also a method of estimating cardinality using autoregressive models . at work [7] in , The author puts forward Naru Model , In the training process, sampling and Mask skill , It can improve the accuracy of the model . Later, the author also extended the method to the case of multi table connection [8].

Another kind of data-driven method uses SPN(sum product network). The main idea is to divide the data table horizontally （ For example, using KMeans And other clustering methods ）, Make the tuple columns approximately independent on the split data , In this way, the cardinality can be estimated by constructing histogram or linear regression model on each column .DeepDB [9] It's a use SPN The cardinality estimation method , In recent years, there have also been reports of SPN Improvement , Such as FLAT [10], It divides the data table into two categories: strong correlation and weak correlation , Further use SPN Method partition data table . The following figure shows DeepDB How to calculate “ Age is less than 30 Number of European customers aged ” Of ：

Current difficulties and future research directions

The first point , Although the current method can be more accurate than the traditional method , But it also takes a lot of time to train the model , How to choose between accuracy and efficiency is a problem that database users who apply these methods need to consider .

Second point , Most models are difficult to adapt to the update environment . When a large amount of data is inserted or deleted , These models are difficult to track these changes , So you need to retrain , This is in OLTP The environment needs to be improved .

The third point , The estimation effect of these models is still poor in the scenario of multi table connection . When multiple meters are connected , Dependencies between different tables are difficult to capture , It brings great difficulties to the estimation of cardinality . This is in 2021 Experimental paper [11] There is also an introduction to .

Last , Recommend interested readers to read the experimental paper [11], It compares the performance and efficiency of multiple learning methods in various scenarios , It has certain guiding value for the research of cardinality estimation .

reference

[1] Viktor Leis, Bernhard Radke, Andrey Gubichev, Alfons Kemper, and Thomas Neumann. 2017. Cardinality Estimation Done Right: Index-Based Join Sampling. In CIDR.

[2] Tanu Malik, Randal C Burns, and Nitesh V Chawla. 2007. A Black-Box Approach to Query Cardinality Estimation.. In CIDR. Citeseer, 56–67.

[3] Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings. www.cidrdb.org.

[4] Ji Sun and Guoliang Li. 2019. An end-to-end learning-based cost estimator. Proceedings of the VLDB Endowment 13, 3 (2019), 307–319.

[5] Max Heimel, Martin Kiefer, and Volker Markl. 2015. Self-tuning, GPU-accelerated kernel density models for multidimensional selectivity estimation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1477–1492.

[6] Martin Kiefer, Max Heimel, Sebastian Breß, and Volker Markl. 2017. Estimating join selectivities using bandwidth-optimized kernel density models. Proceedings of the VLDB Endowment 10, 13 (2017), 2085–2096

[7] Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Xi Chen, Pieter Abbeel, Joseph M Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep unsupervised cardinality estimation. Proceedings of the VLDB Endowment 13, 3 (2019), 279–292.

[8] Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment 14, 1 (2020), 61–73.

[9] Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: learn from data, not from queries! Proceedings of the VLDB Endowment 13, 7 (2020), 992–1005.

[10] Rong Zhu, Ziniu Wu, Yuxing Han, Kai Zeng, Andreas Pfadler, Zhengping Qian, Jingren Zhou, and Bin Cui. 2021. FLAT: fast, lightweight and accurate method for cardinality estimation. Proceedings of the VLDB Endowment 14, 9 (2021), 1489–1502.

[11] Wang, Xiaoying, et al. "Are we ready for learned cardinality estimation?." Proceedings of the VLDB Endowment 14.9 (2021): 1640-1654.

原网站

版权声明
本文为[PKUMOD]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207021740425608.html

当前位置：网站首页>Introduction to the paper | application of machine learning in database cardinality estimation

Introduction to the paper | application of machine learning in database cardinality estimation

边栏推荐

猜你喜欢

随机推荐