当前位置:网站首页>How to implement complex SQL such as distributed database sub query and join?
How to implement complex SQL such as distributed database sub query and join?
2022-07-05 09:43:00 【Tdengine】
author | Liu Yao
edit | Er Yue
Small T Reading guide : Using or implementing distributed databases (Distributed Database) when , The data of a table will be distributed to each database node according to a certain strategy , What follows is the complexity of multi node data query , for example Join And subquery . This article will explain distributed database sub queries and Join Etc SQL How to achieve , To help you better solve the above problems .
First of all, let's briefly talk about SQL Implementation process of :
SQL ==> Parser ==> Translate & Semantic Check ==> Optimizer ==> Coordinator ==> Executer
- Parser The result is a syntax tree , namely Abstract Syntax Tree;
- Translate & Semantic Check, This step will start from Catalog Read metadata , Improve the syntax tree with metadata , Easy Optimizer Use . for example : common select * from tableA, In this step, I usually put “*” Switch to tableA The column of ;
- Optimizer The optimized logical execution plan is generated , namely Optimized Logical Plan, The execution plan is a directed acyclic graph , namely DAG;
- Coordinator Be responsible for distributing the logical execution plan to each node for calculation ;
- Executer Will convert the logical execution plan into the physical execution plan , namely Physical Plan.
There are many open source databases , We can combine the source code of some mainstream databases to understand subqueries and Join How to implement , For example, relational database :Impala、Presto、ClickHouse, Time series database (Time- Series Database): TDengine etc. . The following sub queries and Join Two parts are analyzed .
Sub query part
There are many logical execution plans Node, Corresponding to SQL Various calculations in , Include Scan Node、Join Node、Aggregate Node、Sort Node、Project Node wait , The operator of the corresponding physical execution plan is Scan Operator 、Join Operator、Aggregate Operator、Sort Operator、Project Operator wait . The database generally does not have an operator to calculate sub queries , This is because after converting the abstract syntax tree into a logical execution plan , There is no concept of subquery , Its operation logic is that data operators are transferred layer by layer from bottom to top , And calculate layer by layer , Subqueries are not specifically calculated . Let's talk about some related processing of sub query in distributed database .
First , The optimizer of distributed database will flatten the subquery , This method is generally divided into two types , One is directly in the syntax tree (AST) Flatten subqueries on (Subquery Flatten), The other is to flatten the logic execution plan . The two approaches are essentially the same , Ensure semantic equivalence . But not all subqueries can be flattened , There are several special cases as follows :
- Both the child query and the parent query have aggregate functions
- Subqueries have aggregate functions , And the parent query has group calculation (Group By)
- Subqueries have aggregate functions , And the results of the sub query aggregation function are used to correlate (Join) Table of parent query
- The parent query has an aggregate function , And the sub query has group calculation (Group By)
- Subquery has Limit( Limit the number of rows that return results ), And the parent query has filter conditions (Where) Or group calculation 、 Sort (Order By)
- other
be based on AST When the subquery is flattened , You need to traverse the syntax data first , And judge according to the rules , And then eliminate unnecessary subqueries . Flatten the subquery when generating the logical execution plan , It's generating Plan Node You need to remove the redundant first Node, for instance ,SQL:select colA from (select * from tA) group by colA;
Generally speaking , The logical execution plan will have multiple sub plans , Sub plans are usually generated when network transmission is required , It should be noted that there is no necessary connection between sub plans and sub queries , That is, a sub query does not necessarily correspond to a sub plan .
Join part
First , The distributed database will be used to Join To optimize , Include Join eliminate ( For example, based on the primary key and foreign key, unnecessary Join)、 External connection elimination (Outer Join Turn into Inner Join)、Join Order Optimize ( Statistics based on data , Dynamic programming algorithm 、 Greedy algorithm or genetic algorithm, etc Table Of Join The order ) wait .
Tell me more Join Three basic algorithms :Hash Join( There must be equivalent connection conditions , for example t1.colA = t2.colB)、Merge Join( The data of the left table and the right table are ordered , Ordered by the columns in the join condition )、Nestloop Join( Contains non equivalent connection conditions and the data is out of order ). In practice , Will mix the three algorithms , This is because Join Conditions can contain both equivalent and non equivalent connections , for example t1.colA = t2.colB AND t1.colC > t2.colC
Hash Join
It's going on Join Order To optimize the , The optimizer adjusts the order of the left and right tables , Usually put the small watch on the right , The big watch is on the left , And choose Join Pattern :Shuffle Join( According to the relevant conditions , meanwhile shuffle Left table and right table , And then calculate Join) or Boradcast Join( Broadcast the right table to the node where the left table is located , Note that the left watch does not move , And then calculate Join). Generally, the choice is based on the cost Join Order Optimize , But considering that the statistical information may have errors , So many databases can be accessed through Hint、Query Option Methods such as , It's up to the user to specify Join The order 、Join Patterns, etc .
Hash Join Is the most commonly used Join Algorithm , Most databases implement Hash Join. This algorithm will read the right table first , And put the data of the right table into Hash Map in , If you can't save it, it will be put into external storage . Usually , Each database will implement its own Hash Map, Rarely used directly STL or Boost Etc. in the third-party library Hash Map, There are two main reasons :
- Customized Hash Map Will improve Join Calculation speed .
- Customized Hash Map More accurate control of memory usage , When out of memory , Can use external memory , Customized Hash Map According to Join Algorithm , Optimize Swap Mechanism , Reduce Swap The amount of data .Hash Map The structure is as follows :
The right table may contain duplicate data , So there will be Duplicate Node. The duplicate data here refers to Join Key(Join The column corresponding to the condition ) Duplicate data for , And the other columns do not repeat , So cache them separately . Notice in the picture above , It's through Hash Algorithmic solution Hash The question of conflict , That is, it will not put different Join Key In the same bucket . Of course , In practice, there are also different Join Key In the same barrel , That requires traversal List To determine what to look for Join Key Whether there is .
Merge Join
Merge Join It is generally used when the data of the left table and the right table are in order . For example, temporal database TDengine, The data is sorted by timestamp , Then use the timestamp column to do Join when ,TDengine database Will use Merge Join To calculate , One advantage of this is that the processing speed is very fast , And the memory consumption is very small .
Nestloop Join
such Join The algorithm is very slow , But it is indispensable for the full function database . When using this algorithm , It can be combined with indexing to speed up .
In conclusion ,Hash Join Most widely used , It is applicable to many data analysis scenarios , And most databases support ;Merge Join It is generally used when the data in the left and right tables are in order , No need to cache data , So very little memory is used , There are three calculation speeds Join The fastest algorithm ;Nestloop Join Poor performance , Distributed databases are rarely used , Some distributed databases do not support , Can be accelerated by indexing Nestloop Join.
At the end
Above, we pair subquery and Join Two kinds of complexity SQL The implementation method of , You can combine the source code of some open source databases to understand , image TDengine The source code can be found in GitHub See above , If you have a complex time series database SQL Interested in implementing , This is a good object to observe . You are also welcome to communicate in the comment area below .
Want to know more TDengine Database Specific details of , Welcome to GitHub View the relevant source code on .
边栏推荐
- Community group buying has triggered heated discussion. How does this model work?
- LeetCode 31. Next spread
- uni-app---uni. Navigateto jump parameter use
- Talking about the difference between unittest and pytest
- C language - input array two-dimensional array a from the keyboard, and put 3 in a × 5. The elements in the third column of the matrix are moved to the left to the 0 column, and the element rows in ea
- Kotlin introductory notes (I) kotlin variables and non variables
- Resolve the horizontal (vertical) sliding conflict between viewpager and WebView
- Viewpager pageradapter notifydatasetchanged invalid problem
- 【两个对象合并成一个对象】
- E-commerce apps are becoming more and more popular. What are the advantages of being an app?
猜你喜欢
Unity skframework framework (XXII), runtime console runtime debugging tool
An article takes you into the world of cookies, sessions, and tokens
Android 隐私沙盒开发者预览版 3: 隐私安全和个性化体验全都要
Using request headers to develop multi terminal applications
How to empty uploaded attachments with components encapsulated by El upload
一篇文章带你走进cookie,session,Token的世界
TDengine可通过数据同步工具 DataX读写
How do enterprises choose the appropriate three-level distribution system?
The research trend of map based comparative learning (gnn+cl) in the top paper
植物大战僵尸Scratch
随机推荐
正式上架!TDengine 插件入驻 Grafana 官网
Shutter uses overlay to realize global pop-up
[JS sort according to the attributes in the object array]
【el-table如何禁用】
【对象数组的排序】
【两个对象合并成一个对象】
LeetCode 503. 下一个更大元素 II
Lepton 无损压缩原理及性能分析
【对象数组a与对象数组b取出id不同元素赋值给新的数组】
Lepton 无损压缩原理及性能分析
C语言-从键盘输入数组二维数组a,将a中3×5矩阵中第3列的元素左移到第0列,第3列以后的每列元素行依次左移,原来左边的各列依次绕到右边
Applet customization component
TDengine ×英特尔边缘洞见软件包 加速传统行业的数字化转型
Principle and performance analysis of lepton lossless compression
一文读懂TDengine的窗口查询功能
Wxml template syntax
从“化学家”到开发者,从甲骨文到TDengine,我人生的两次重要抉择
解决Navicat激活、注册时候出现No All Pattern Found的问题
Node の MongoDB Driver
Viewpager pageradapter notifydatasetchanged invalid problem