当前位置:网站首页>How to implement complex SQL such as distributed database sub query and join?
How to implement complex SQL such as distributed database sub query and join?
2022-07-05 09:43:00 【Tdengine】
author | Liu Yao
edit | Er Yue
Small T Reading guide : Using or implementing distributed databases (Distributed Database) when , The data of a table will be distributed to each database node according to a certain strategy , What follows is the complexity of multi node data query , for example Join And subquery . This article will explain distributed database sub queries and Join Etc SQL How to achieve , To help you better solve the above problems .
First of all, let's briefly talk about SQL Implementation process of :
SQL ==> Parser ==> Translate & Semantic Check ==> Optimizer ==> Coordinator ==> Executer
- Parser The result is a syntax tree , namely Abstract Syntax Tree;
- Translate & Semantic Check, This step will start from Catalog Read metadata , Improve the syntax tree with metadata , Easy Optimizer Use . for example : common select * from tableA, In this step, I usually put “*” Switch to tableA The column of ;
- Optimizer The optimized logical execution plan is generated , namely Optimized Logical Plan, The execution plan is a directed acyclic graph , namely DAG;
- Coordinator Be responsible for distributing the logical execution plan to each node for calculation ;
- Executer Will convert the logical execution plan into the physical execution plan , namely Physical Plan.
There are many open source databases , We can combine the source code of some mainstream databases to understand subqueries and Join How to implement , For example, relational database :Impala、Presto、ClickHouse, Time series database (Time- Series Database): TDengine etc. . The following sub queries and Join Two parts are analyzed .
Sub query part
There are many logical execution plans Node, Corresponding to SQL Various calculations in , Include Scan Node、Join Node、Aggregate Node、Sort Node、Project Node wait , The operator of the corresponding physical execution plan is Scan Operator 、Join Operator、Aggregate Operator、Sort Operator、Project Operator wait . The database generally does not have an operator to calculate sub queries , This is because after converting the abstract syntax tree into a logical execution plan , There is no concept of subquery , Its operation logic is that data operators are transferred layer by layer from bottom to top , And calculate layer by layer , Subqueries are not specifically calculated . Let's talk about some related processing of sub query in distributed database .
First , The optimizer of distributed database will flatten the subquery , This method is generally divided into two types , One is directly in the syntax tree (AST) Flatten subqueries on (Subquery Flatten), The other is to flatten the logic execution plan . The two approaches are essentially the same , Ensure semantic equivalence . But not all subqueries can be flattened , There are several special cases as follows :
- Both the child query and the parent query have aggregate functions
- Subqueries have aggregate functions , And the parent query has group calculation (Group By)
- Subqueries have aggregate functions , And the results of the sub query aggregation function are used to correlate (Join) Table of parent query
- The parent query has an aggregate function , And the sub query has group calculation (Group By)
- Subquery has Limit( Limit the number of rows that return results ), And the parent query has filter conditions (Where) Or group calculation 、 Sort (Order By)
- other
be based on AST When the subquery is flattened , You need to traverse the syntax data first , And judge according to the rules , And then eliminate unnecessary subqueries . Flatten the subquery when generating the logical execution plan , It's generating Plan Node You need to remove the redundant first Node, for instance ,SQL:select colA from (select * from tA) group by colA;
Generally speaking , The logical execution plan will have multiple sub plans , Sub plans are usually generated when network transmission is required , It should be noted that there is no necessary connection between sub plans and sub queries , That is, a sub query does not necessarily correspond to a sub plan .
Join part
First , The distributed database will be used to Join To optimize , Include Join eliminate ( For example, based on the primary key and foreign key, unnecessary Join)、 External connection elimination (Outer Join Turn into Inner Join)、Join Order Optimize ( Statistics based on data , Dynamic programming algorithm 、 Greedy algorithm or genetic algorithm, etc Table Of Join The order ) wait .
Tell me more Join Three basic algorithms :Hash Join( There must be equivalent connection conditions , for example t1.colA = t2.colB)、Merge Join( The data of the left table and the right table are ordered , Ordered by the columns in the join condition )、Nestloop Join( Contains non equivalent connection conditions and the data is out of order ). In practice , Will mix the three algorithms , This is because Join Conditions can contain both equivalent and non equivalent connections , for example t1.colA = t2.colB AND t1.colC > t2.colC
Hash Join
It's going on Join Order To optimize the , The optimizer adjusts the order of the left and right tables , Usually put the small watch on the right , The big watch is on the left , And choose Join Pattern :Shuffle Join( According to the relevant conditions , meanwhile shuffle Left table and right table , And then calculate Join) or Boradcast Join( Broadcast the right table to the node where the left table is located , Note that the left watch does not move , And then calculate Join). Generally, the choice is based on the cost Join Order Optimize , But considering that the statistical information may have errors , So many databases can be accessed through Hint、Query Option Methods such as , It's up to the user to specify Join The order 、Join Patterns, etc .
Hash Join Is the most commonly used Join Algorithm , Most databases implement Hash Join. This algorithm will read the right table first , And put the data of the right table into Hash Map in , If you can't save it, it will be put into external storage . Usually , Each database will implement its own Hash Map, Rarely used directly STL or Boost Etc. in the third-party library Hash Map, There are two main reasons :
- Customized Hash Map Will improve Join Calculation speed .
- Customized Hash Map More accurate control of memory usage , When out of memory , Can use external memory , Customized Hash Map According to Join Algorithm , Optimize Swap Mechanism , Reduce Swap The amount of data .Hash Map The structure is as follows :
The right table may contain duplicate data , So there will be Duplicate Node. The duplicate data here refers to Join Key(Join The column corresponding to the condition ) Duplicate data for , And the other columns do not repeat , So cache them separately . Notice in the picture above , It's through Hash Algorithmic solution Hash The question of conflict , That is, it will not put different Join Key In the same bucket . Of course , In practice, there are also different Join Key In the same barrel , That requires traversal List To determine what to look for Join Key Whether there is .
Merge Join
Merge Join It is generally used when the data of the left table and the right table are in order . For example, temporal database TDengine, The data is sorted by timestamp , Then use the timestamp column to do Join when ,TDengine database Will use Merge Join To calculate , One advantage of this is that the processing speed is very fast , And the memory consumption is very small .
Nestloop Join
such Join The algorithm is very slow , But it is indispensable for the full function database . When using this algorithm , It can be combined with indexing to speed up .
In conclusion ,Hash Join Most widely used , It is applicable to many data analysis scenarios , And most databases support ;Merge Join It is generally used when the data in the left and right tables are in order , No need to cache data , So very little memory is used , There are three calculation speeds Join The fastest algorithm ;Nestloop Join Poor performance , Distributed databases are rarely used , Some distributed databases do not support , Can be accelerated by indexing Nestloop Join.
At the end
Above, we pair subquery and Join Two kinds of complexity SQL The implementation method of , You can combine the source code of some open source databases to understand , image TDengine The source code can be found in GitHub See above , If you have a complex time series database SQL Interested in implementing , This is a good object to observe . You are also welcome to communicate in the comment area below .
Want to know more TDengine Database Specific details of , Welcome to GitHub View the relevant source code on .
边栏推荐
- Why do offline stores need cashier software?
- 观测云与 TDengine 达成深度合作,优化企业上云体验
- Android 隐私沙盒开发者预览版 3: 隐私安全和个性化体验全都要
- Principle and performance analysis of lepton lossless compression
- The popularity of B2B2C continues to rise. What are the benefits of enterprises doing multi-user mall system?
- [how to disable El table]
- 【sourceTree配置SSH及使用】
- 项目实战 | Excel导出功能
- Kotlin introductory notes (I) kotlin variables and non variables
- TDengine 连接器上线 Google Data Studio 应用商店
猜你喜欢
项目实战 | Excel导出功能
Understanding of smt32h7 series DMA and DMAMUX
How to improve the operation efficiency of intra city distribution
High performance spark_ Transformation performance
LeetCode 556. Next bigger element III
C语言-从键盘输入数组二维数组a,将a中3×5矩阵中第3列的元素左移到第0列,第3列以后的每列元素行依次左移,原来左边的各列依次绕到右边
SMT32H7系列DMA和DMAMUX的一点理解
TDengine 连接器上线 Google Data Studio 应用商店
[ManageEngine] how to make good use of the report function of OpManager
VS Code问题:长行的长度可通过 “editor.maxTokenizationLineLength“ 进行配置
随机推荐
Analysis of eventbus source code
[object array A and object array B take out different elements of ID and assign them to the new array]
Dry goods sorting! How about the development trend of ERP in the manufacturing industry? It's enough to read this article
NIPS2021 | 超越GraphCL,GNN+对比学习的节点分类新SOTA
移动端异构运算技术-GPU OpenCL编程(进阶篇)
初识结构体
Kotlin introductory notes (VI) interface and function visibility modifiers
云计算技术热点
一文读懂TDengine的窗口查询功能
Applet global style configuration window
Viewpager pageradapter notifydatasetchanged invalid problem
高性能Spark_transformation性能
Kotlin introductory notes (I) kotlin variables and non variables
Why do offline stores need cashier software?
How to improve the operation efficiency of intra city distribution
TDengine 连接器上线 Google Data Studio 应用商店
[reading notes] Figure comparative learning gnn+cl
Go 语言使用 MySQL 的常见故障分析和应对方法
Deep understanding of C language pointer
[team PK competition] the task of this week has been opened | question answering challenge to consolidate the knowledge of commodity details