当前位置：网站首页>How to implement complex SQL such as distributed database sub query and join?

How to implement complex SQL such as distributed database sub query and join?

2022-07-05 09:43:00 【Tdengine】

author | Liu Yao

edit | Er Yue

Small T Reading guide ： Using or implementing distributed databases （Distributed Database） when , The data of a table will be distributed to each database node according to a certain strategy , What follows is the complexity of multi node data query , for example Join And subquery . This article will explain distributed database sub queries and Join Etc SQL How to achieve , To help you better solve the above problems .

First of all, let's briefly talk about SQL Implementation process of ：

SQL ==> Parser ==> Translate & Semantic Check ==> Optimizer ==> Coordinator ==> Executer

Parser The result is a syntax tree , namely Abstract Syntax Tree;
Translate & Semantic Check, This step will start from Catalog Read metadata , Improve the syntax tree with metadata , Easy Optimizer Use . for example ： common select * from tableA, In this step, I usually put “*” Switch to tableA The column of ;
Optimizer The optimized logical execution plan is generated , namely Optimized Logical Plan, The execution plan is a directed acyclic graph , namely DAG;
Coordinator Be responsible for distributing the logical execution plan to each node for calculation ;
Executer Will convert the logical execution plan into the physical execution plan , namely Physical Plan.

There are many open source databases , We can combine the source code of some mainstream databases to understand subqueries and Join How to implement , For example, relational database ：Impala、Presto、ClickHouse, Time series database （Time- Series Database）： TDengine etc. . The following sub queries and Join Two parts are analyzed .

Sub query part

There are many logical execution plans Node, Corresponding to SQL Various calculations in , Include Scan Node、Join Node、Aggregate Node、Sort Node、Project Node wait , The operator of the corresponding physical execution plan is Scan Operator 、Join Operator、Aggregate Operator、Sort Operator、Project Operator wait . The database generally does not have an operator to calculate sub queries , This is because after converting the abstract syntax tree into a logical execution plan , There is no concept of subquery , Its operation logic is that data operators are transferred layer by layer from bottom to top , And calculate layer by layer , Subqueries are not specifically calculated . Let's talk about some related processing of sub query in distributed database .

First , The optimizer of distributed database will flatten the subquery , This method is generally divided into two types , One is directly in the syntax tree （AST） Flatten subqueries on （Subquery Flatten）, The other is to flatten the logic execution plan . The two approaches are essentially the same , Ensure semantic equivalence . But not all subqueries can be flattened , There are several special cases as follows ：

Both the child query and the parent query have aggregate functions
Subqueries have aggregate functions , And the parent query has group calculation （Group By）
Subqueries have aggregate functions , And the results of the sub query aggregation function are used to correlate （Join） Table of parent query
The parent query has an aggregate function , And the sub query has group calculation （Group By）
Subquery has Limit（ Limit the number of rows that return results ）, And the parent query has filter conditions （Where） Or group calculation 、 Sort （Order By）
other

be based on AST When the subquery is flattened , You need to traverse the syntax data first , And judge according to the rules , And then eliminate unnecessary subqueries . Flatten the subquery when generating the logical execution plan , It's generating Plan Node You need to remove the redundant first Node, for instance ,SQL：select colA from (select * from tA) group by colA;

Generally speaking , The logical execution plan will have multiple sub plans , Sub plans are usually generated when network transmission is required , It should be noted that there is no necessary connection between sub plans and sub queries , That is, a sub query does not necessarily correspond to a sub plan .

Join part

First , The distributed database will be used to Join To optimize , Include Join eliminate （ For example, based on the primary key and foreign key, unnecessary Join）、 External connection elimination （Outer Join Turn into Inner Join）、Join Order Optimize （ Statistics based on data , Dynamic programming algorithm 、 Greedy algorithm or genetic algorithm, etc Table Of Join The order ） wait .

Tell me more Join Three basic algorithms ：Hash Join（ There must be equivalent connection conditions , for example t1.colA = t2.colB）、Merge Join（ The data of the left table and the right table are ordered , Ordered by the columns in the join condition ）、Nestloop Join（ Contains non equivalent connection conditions and the data is out of order ）. In practice , Will mix the three algorithms , This is because Join Conditions can contain both equivalent and non equivalent connections , for example t1.colA = t2.colB AND t1.colC > t2.colC

Hash Join

It's going on Join Order To optimize the , The optimizer adjusts the order of the left and right tables , Usually put the small watch on the right , The big watch is on the left , And choose Join Pattern ：Shuffle Join（ According to the relevant conditions , meanwhile shuffle Left table and right table , And then calculate Join） or Boradcast Join（ Broadcast the right table to the node where the left table is located , Note that the left watch does not move , And then calculate Join）. Generally, the choice is based on the cost Join Order Optimize , But considering that the statistical information may have errors , So many databases can be accessed through Hint、Query Option Methods such as , It's up to the user to specify Join The order 、Join Patterns, etc .

Hash Join Is the most commonly used Join Algorithm , Most databases implement Hash Join. This algorithm will read the right table first , And put the data of the right table into Hash Map in , If you can't save it, it will be put into external storage . Usually , Each database will implement its own Hash Map, Rarely used directly STL or Boost Etc. in the third-party library Hash Map, There are two main reasons ：

Customized Hash Map Will improve Join Calculation speed .
Customized Hash Map More accurate control of memory usage , When out of memory , Can use external memory , Customized Hash Map According to Join Algorithm , Optimize Swap Mechanism , Reduce Swap The amount of data .Hash Map The structure is as follows ：

The right table may contain duplicate data , So there will be Duplicate Node. The duplicate data here refers to Join Key（Join The column corresponding to the condition ） Duplicate data for , And the other columns do not repeat , So cache them separately . Notice in the picture above , It's through Hash Algorithmic solution Hash The question of conflict , That is, it will not put different Join Key In the same bucket . Of course , In practice, there are also different Join Key In the same barrel , That requires traversal List To determine what to look for Join Key Whether there is .

Merge Join

Merge Join It is generally used when the data of the left table and the right table are in order . For example, temporal database TDengine, The data is sorted by timestamp , Then use the timestamp column to do Join when ,TDengine database Will use Merge Join To calculate , One advantage of this is that the processing speed is very fast , And the memory consumption is very small .

Nestloop Join

such Join The algorithm is very slow , But it is indispensable for the full function database . When using this algorithm , It can be combined with indexing to speed up .

In conclusion ,Hash Join Most widely used , It is applicable to many data analysis scenarios , And most databases support ;Merge Join It is generally used when the data in the left and right tables are in order , No need to cache data , So very little memory is used , There are three calculation speeds Join The fastest algorithm ;Nestloop Join Poor performance , Distributed databases are rarely used , Some distributed databases do not support , Can be accelerated by indexing Nestloop Join.

At the end

Above, we pair subquery and Join Two kinds of complexity SQL The implementation method of , You can combine the source code of some open source databases to understand , image TDengine The source code can be found in GitHub See above , If you have a complex time series database SQL Interested in implementing , This is a good object to observe . You are also welcome to communicate in the comment area below .

Want to know more TDengine Database Specific details of , Welcome to GitHub View the relevant source code on .

原网站

版权声明
本文为[Tdengine]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/186/202207050927513652.html