当前位置:网站首页>How to implement complex SQL such as distributed database sub query and join?
How to implement complex SQL such as distributed database sub query and join?
2022-07-05 09:43:00 【Tdengine】
author | Liu Yao
edit | Er Yue
Small T Reading guide : Using or implementing distributed databases (Distributed Database) when , The data of a table will be distributed to each database node according to a certain strategy , What follows is the complexity of multi node data query , for example Join And subquery . This article will explain distributed database sub queries and Join Etc SQL How to achieve , To help you better solve the above problems .
First of all, let's briefly talk about SQL Implementation process of :
SQL ==> Parser ==> Translate & Semantic Check ==> Optimizer ==> Coordinator ==> Executer
- Parser The result is a syntax tree , namely Abstract Syntax Tree;
- Translate & Semantic Check, This step will start from Catalog Read metadata , Improve the syntax tree with metadata , Easy Optimizer Use . for example : common select * from tableA, In this step, I usually put “*” Switch to tableA The column of ;
- Optimizer The optimized logical execution plan is generated , namely Optimized Logical Plan, The execution plan is a directed acyclic graph , namely DAG;
- Coordinator Be responsible for distributing the logical execution plan to each node for calculation ;
- Executer Will convert the logical execution plan into the physical execution plan , namely Physical Plan.
There are many open source databases , We can combine the source code of some mainstream databases to understand subqueries and Join How to implement , For example, relational database :Impala、Presto、ClickHouse, Time series database (Time- Series Database): TDengine etc. . The following sub queries and Join Two parts are analyzed .
Sub query part
There are many logical execution plans Node, Corresponding to SQL Various calculations in , Include Scan Node、Join Node、Aggregate Node、Sort Node、Project Node wait , The operator of the corresponding physical execution plan is Scan Operator 、Join Operator、Aggregate Operator、Sort Operator、Project Operator wait . The database generally does not have an operator to calculate sub queries , This is because after converting the abstract syntax tree into a logical execution plan , There is no concept of subquery , Its operation logic is that data operators are transferred layer by layer from bottom to top , And calculate layer by layer , Subqueries are not specifically calculated . Let's talk about some related processing of sub query in distributed database .
First , The optimizer of distributed database will flatten the subquery , This method is generally divided into two types , One is directly in the syntax tree (AST) Flatten subqueries on (Subquery Flatten), The other is to flatten the logic execution plan . The two approaches are essentially the same , Ensure semantic equivalence . But not all subqueries can be flattened , There are several special cases as follows :
- Both the child query and the parent query have aggregate functions
- Subqueries have aggregate functions , And the parent query has group calculation (Group By)
- Subqueries have aggregate functions , And the results of the sub query aggregation function are used to correlate (Join) Table of parent query
- The parent query has an aggregate function , And the sub query has group calculation (Group By)
- Subquery has Limit( Limit the number of rows that return results ), And the parent query has filter conditions (Where) Or group calculation 、 Sort (Order By)
- other
be based on AST When the subquery is flattened , You need to traverse the syntax data first , And judge according to the rules , And then eliminate unnecessary subqueries . Flatten the subquery when generating the logical execution plan , It's generating Plan Node You need to remove the redundant first Node, for instance ,SQL:select colA from (select * from tA) group by colA;

Generally speaking , The logical execution plan will have multiple sub plans , Sub plans are usually generated when network transmission is required , It should be noted that there is no necessary connection between sub plans and sub queries , That is, a sub query does not necessarily correspond to a sub plan .
Join part
First , The distributed database will be used to Join To optimize , Include Join eliminate ( For example, based on the primary key and foreign key, unnecessary Join)、 External connection elimination (Outer Join Turn into Inner Join)、Join Order Optimize ( Statistics based on data , Dynamic programming algorithm 、 Greedy algorithm or genetic algorithm, etc Table Of Join The order ) wait .
Tell me more Join Three basic algorithms :Hash Join( There must be equivalent connection conditions , for example t1.colA = t2.colB)、Merge Join( The data of the left table and the right table are ordered , Ordered by the columns in the join condition )、Nestloop Join( Contains non equivalent connection conditions and the data is out of order ). In practice , Will mix the three algorithms , This is because Join Conditions can contain both equivalent and non equivalent connections , for example t1.colA = t2.colB AND t1.colC > t2.colC
Hash Join
It's going on Join Order To optimize the , The optimizer adjusts the order of the left and right tables , Usually put the small watch on the right , The big watch is on the left , And choose Join Pattern :Shuffle Join( According to the relevant conditions , meanwhile shuffle Left table and right table , And then calculate Join) or Boradcast Join( Broadcast the right table to the node where the left table is located , Note that the left watch does not move , And then calculate Join). Generally, the choice is based on the cost Join Order Optimize , But considering that the statistical information may have errors , So many databases can be accessed through Hint、Query Option Methods such as , It's up to the user to specify Join The order 、Join Patterns, etc .
Hash Join Is the most commonly used Join Algorithm , Most databases implement Hash Join. This algorithm will read the right table first , And put the data of the right table into Hash Map in , If you can't save it, it will be put into external storage . Usually , Each database will implement its own Hash Map, Rarely used directly STL or Boost Etc. in the third-party library Hash Map, There are two main reasons :
- Customized Hash Map Will improve Join Calculation speed .
- Customized Hash Map More accurate control of memory usage , When out of memory , Can use external memory , Customized Hash Map According to Join Algorithm , Optimize Swap Mechanism , Reduce Swap The amount of data .Hash Map The structure is as follows :

The right table may contain duplicate data , So there will be Duplicate Node. The duplicate data here refers to Join Key(Join The column corresponding to the condition ) Duplicate data for , And the other columns do not repeat , So cache them separately . Notice in the picture above , It's through Hash Algorithmic solution Hash The question of conflict , That is, it will not put different Join Key In the same bucket . Of course , In practice, there are also different Join Key In the same barrel , That requires traversal List To determine what to look for Join Key Whether there is .
Merge Join
Merge Join It is generally used when the data of the left table and the right table are in order . For example, temporal database TDengine, The data is sorted by timestamp , Then use the timestamp column to do Join when ,TDengine database Will use Merge Join To calculate , One advantage of this is that the processing speed is very fast , And the memory consumption is very small .
Nestloop Join
such Join The algorithm is very slow , But it is indispensable for the full function database . When using this algorithm , It can be combined with indexing to speed up .
In conclusion ,Hash Join Most widely used , It is applicable to many data analysis scenarios , And most databases support ;Merge Join It is generally used when the data in the left and right tables are in order , No need to cache data , So very little memory is used , There are three calculation speeds Join The fastest algorithm ;Nestloop Join Poor performance , Distributed databases are rarely used , Some distributed databases do not support , Can be accelerated by indexing Nestloop Join.
At the end
Above, we pair subquery and Join Two kinds of complexity SQL The implementation method of , You can combine the source code of some open source databases to understand , image TDengine The source code can be found in GitHub See above , If you have a complex time series database SQL Interested in implementing , This is a good object to observe . You are also welcome to communicate in the comment area below .
Want to know more TDengine Database Specific details of , Welcome to GitHub View the relevant source code on .
边栏推荐
- 微信小程序获取住户地区信息
- 从“化学家”到开发者,从甲骨文到 TDengine,我人生的两次重要抉择
- E-commerce apps are becoming more and more popular. What are the advantages of being an app?
- Android 隐私沙盒开发者预览版 3: 隐私安全和个性化体验全都要
- Kotlin introductory notes (III) kotlin program logic control (if, when)
- Why do offline stores need cashier software?
- SQL learning group by multi table grouping scenario
- Wxml template syntax
- 测试老鸟浅谈unittest和pytest的区别
- C form click event did not respond
猜你喜欢

What should we pay attention to when developing B2C websites?

How to choose the right chain management software?

Svg optimization by svgo

Using request headers to develop multi terminal applications

Understanding of smt32h7 series DMA and DMAMUX

What should we pay attention to when entering the community e-commerce business?

解决Navicat激活、注册时候出现No All Pattern Found的问题

LeetCode 556. 下一个更大元素 III
![[sourcetree configure SSH and use]](/img/9a/1cd4ca29e5b7a3016ed6d5dc1abbef.png)
[sourcetree configure SSH and use]

OpenGL - Lighting
随机推荐
小程序启动性能优化实践
微信小程序获取住户地区信息
干货整理!ERP在制造业的发展趋势如何,看这一篇就够了
Unity skframework framework (XXIII), minimap small map tool
SQL learning - case when then else
【el-table如何禁用】
STM32 simple multi-level menu (array table lookup method)
Unity skframework framework (XXII), runtime console runtime debugging tool
Wechat applet obtains household area information
A detailed explanation of the general process and the latest research trends of map comparative learning (gnn+cl)
【js 根据对象数组中的属性进行排序】
Wxml template syntax
How to improve the operation efficiency of intra city distribution
Global configuration tabbar
揭秘百度智能测试在测试自动执行领域实践
What should we pay attention to when entering the community e-commerce business?
TDengine可通过数据同步工具 DataX读写
【组队 PK 赛】本周任务已开启 | 答题挑战,夯实商品详情知识
如何正确的评测视频画质
LeetCode 496. Next larger element I