当前位置:网站首页>SQL performance optimization is really eye popping
SQL performance optimization is really eye popping
2022-06-29 03:37:00 【Not just Chen】
Many big data calculations use SQL Realized , When you run slowly, you have to optimize SQL, But we often encounter situations that make people stare .
such as , There are three statements in the stored procedure that are roughly like this, which execute very slowly :
select a,b,sum(x) from T group by a,b where …;
select c,d,max(y) from T group by c,d where …;
select a,c,avg(y),min(z) from T group by a,c where …;there T It's a huge watch with hundreds of millions of lines , To group in three ways , The result set of grouping is not large .
The grouping operation needs to traverse the data table , These three sentences SQL You have to traverse this big table three times , It takes a long time to traverse hundreds of millions of rows of data , Not to mention three times .
In this grouping operation , Relative to the time of traversing the hard disk ,CPU The calculation time is almost negligible . If you can calculate the summary of multiple groups in one traversal , although CPU The amount of calculation has not decreased , But it can greatly reduce the amount of data read from the hard disk , You can double the speed .
If SQL Support syntax like this :
from T -- The data come from T surface
select a,b,sum(x) group by a,b where … -- The first grouping in traversal
select c,d,max(y) group by c,d where … -- The second group in traversal
select a,c,avg(y),min(z) group by a,c where …; -- The third group in traversal Can return multiple result sets at a time , Then you can greatly improve the performance .
unfortunately , SQL There is no such grammar , Can't write such a statement , Only one alternative , Just use group a,b,c,d First calculate a more detailed grouping result set , But first save it as a temporary table , To further use SQL Calculate the target result .SQL As follows :
create table T_temp as select a,b,c,d,
sum(case when … then x else 0 end) sumx,
max(case when … then y else null end) maxy,
sum(case when … then y else 0 end) sumy,
count(case when … then 1 else null end) county,
min(case when … then z else null end) minz
group by a,b,c,d;
select a,b,sum(sumx) from T_temp group by a,b where …;
select c,d,max(maxy) from T_temp group by c,d where …;
select a,c,sum(sumy)/sum(county),min(minz) from T_temp group by a,c where …;So just traverse once , But take different WHERE The condition goes to the previous case when in , The code is much more complex , It will also increase the amount of calculation . and , When calculating the temporary table, the number of grouping fields becomes large , The result set can be very large , Finally, the temporary table is traversed many times , Computing performance is not fast . Large result set grouping calculation also needs hard disk cache , Its performance is also very poor .
You can also use the database cursor of the stored procedure to put data one by one fetch Come out and calculate , But it has to be done all by yourself WHERE and GROUP The action of , It's too cumbersome to write , The performance of database cursor traversing data will only be worse !
Just stare !
TopN Operation will also encounter this helplessness . for instance , use Oracle Of SQL Write top5 It looks something like this :
select * from (select x from T order by x desc) where rownum<=5surface T Yes 10 Billion data , from SQL Look at the sentences , Is to sort all the data before taking it out 5 name , The rest of the sorting results are useless ! Large sorting costs a lot , The amount of data is too large to fit in memory , There will be multiple hard disk data switching , Computing performance will be very poor !
It's not hard to avoid big sorting , Keep a... In memory 5 A small collection of records , When traversing data , Before the calculated data 5 Names are preserved in this small set , If the new data obtained is more than the current second 5 Famous , Then insert it and throw away the present 5 name , If it's better than the current 5 The name should be small , Do not act . To do so , As long as the 10 100 million pieces of data can be traversed once , And the memory consumption is very small , Computing performance will be greatly improved .
The essence of this algorithm is to TopN It is also regarded as the sum of 、 Count the same aggregation operation , It just returns a collection, not a single value .SQL If it could be written like this :select top(x,5) from T You can avoid big sorting .
Unfortunately ,SQL There is no explicit set data type , Aggregate functions can only return single values , Can't write such a statement !
But the good thing is that the whole episode TopN Relatively simple , although SQL Write like that , The database is usually optimized in Engineering , Use the above method to avoid large sorting . therefore Oracle Count that SQL Not slow .
however , If TopN The situation is complicated , Used in subqueries or with JOIN When we get together , Optimization engines usually don't work . For example, after grouping, calculate the of each group TopN, use SQL It's a little difficult to write .Oracle Of SQL It's written like this :
select * from
(select y,x,row_number() over (partition by y order by x desc) rn from T)
where rn<=5Now , The database optimization engine is dizzy , We will not use the above method to TopN Understand the method of aggregation operation . I have to sort , As a result, the operation speed drops sharply !
If SQL The grouping TopN Can write like this :
select y,top(x,5) from T group by yhold top As and sum The same aggregate function , It's not only easier to read , And it's easy to calculate at high speed .
unfortunately , no way .
Still stare !
Correlation calculation is also very common . Take the filtering calculation after an order is associated with multiple tables as an example ,SQL It's roughly like this :
select o.oid,o.orderdate,o.amount
from orders o
left join city ci on o.cityid = ci.cityid
left join shipper sh on o.shid=sh.shid
left join employee e on o.eid=e.eid
left join supplier su on o.suid=su.suid
where ci.state='New York'
and e.title = 'manager'
and ... The order form has tens of millions of data , City 、 Shippers 、 Employee 、 The data of suppliers and other tables are not large . The filter criteria fields may come from these tables , And the front end transmits parameters to the background , It's dynamic .
SQL It is generally used HASH JOIN The algorithm implements these associations , To calculate HASH Value and compare . Only one... Can be parsed at a time JOIN, Yes N individual JOIN To execute N Pass action , After each association, you need to keep the intermediate results for the next round , The calculation process is complex , The data will also be traversed many times , Poor computing performance .
Usually , These associated code tables are small , You can read it into memory first . If each associated field in the order table is serialized in advance , For example, convert the employee number field value to the serial number of the corresponding employee table record . So when calculating , You can use the employee number field value ( That is, the employee table serial number ), Directly get the record of the corresponding position of the employee table in memory , Performance ratio HASH JOIN Much faster , And you only need to traverse the order table once , The speed increase will be very obvious !
That is, you can put SQL Write it as follows :
select o.oid,o.orderdate,o.amount
from orders o
left join city c on o.cid = c.# -- The city number of the order form is through the serial number # Associated city table
left join shipper sh on o.shid=sh.# -- Order form shipper number through serial number # Associated shippers table
left join employee e on o.eid=e.# -- The employee number of the order form is by serial number # Associated employee table
left join supplier su on o.suid=su.# -- The supplier number of the order form is passed through the serial number # Associated supplier table
where ci.state='New York'
and e.title = 'manager'
and ...It is a pity ,SQL The concept of unordered set is used , Even if these numbers have been numbered , Databases can't take advantage of this feature , The mechanism of rapid sequence number positioning cannot be used on these unordered sets of corresponding association tables , Only the index can be used to find , And the database doesn't know that the number is serialized , Still calculate HASH Value and comparison , The performance is still very poor !
There are good methods that can't be implemented , Can only stare again !
And highly concurrent account queries , This operation is very simple :
select id,amt,tdate,… from T
where id='10100'
and tdate>= to_date('2021-01-10', 'yyyy-MM-dd')
and tdate<to_date('2021-01-25', 'yyyy-MM-dd')
and …stay T In the hundreds of millions of historical data in the table , Quickly find several to thousands of details of an account ,SQL It's not complicated to write , The difficulty is that the response speed should reach the second level or even faster in case of large concurrency . In order to improve query response speed , It's usually about T Tabular id Field indexing :
create index index_T_1 on T(id)In the database , It's fast to find a single account with an index , But when there is a lot of concurrency, it will obviously slow down . The reason is mentioned above SQL Theoretical basis of disorder , The total amount of data is very large , Unable to read all into memory , The database cannot guarantee that the data of the same account is physically stored continuously . The hard disk has the smallest reading unit , When reading discontinuous data , Will take out a lot of irrelevant content , Queries will slow down . Each query with high concurrent access is slower , The overall performance will be very poor . At a time when experience is very important , Who dares to let users wait more than ten seconds ?!
The easy way to think of is , Sort hundreds of millions of data in advance according to accounts , Ensure the continuous storage of data in the same account , Almost all the data blocks read out from the hard disk during query are target values , Performance will be greatly improved .
however , use SQL The relational database of the system does not have this awareness , The physical order of data storage is not enforced ! The problem is not SQL Caused by grammar , But also with SQL The theoretical basis of , There is still no way to implement these algorithms in relational databases .
To do that ? Can only stare ?
No more SQL And relational databases , To use another computing engine .
Open source concentrator SPL Based on the theoretical basis of innovation , Support more data types and operations , Be able to describe the new algorithm in the above scenario . Use simple and convenient SPL Write code , It can greatly improve the computing performance in a short time !
The above questions use SPL The code example written is as follows :
Multiple groups are calculated in one traversal
| A | B | |
| 1 | A1=file("T.ctx").open().cursor(a,b,c,d,x,y,z) | |
| 2 | cursor A1 | =A2.select(…).groups(a,b;sum(x)) |
| 3 | // Define the first filter in traversal 、 grouping | |
| 4 | cursor | =A4.select(…).groups(c,d;max(y)) |
| 5 | // Define the second filter in traversal 、 grouping | |
| 6 | cursor | =A6.select(…).groupx(a,c;avg(y),min(z)) |
| 7 | // Define the third filter in traversal 、 grouping | |
| 8 | … | // End of definition , Start to calculate the three ways of filtering 、 grouping |
Calculate by aggregation Top5
The complete Top5( Multithreaded parallel computing )
| A | |
| 1 | =file("T.ctx").open() |
| 2 | [email protected](x).total(top(-5,x), top(5,x)) |
| 3 | // top(-5,x) To calculate the x The biggest front 5 name ,top(5,x) yes x The smallest front 5 name . |
grouping Top5( Multithreaded parallel computing )
| A | |
| 1 | =file("T.ctx").open() |
| 2 | [email protected](x,y).groups(y;top(-5,x), top(5,x)) |
Use the serial number as the associated SPL Code :
System initialization
| A | |
| 2 | >env(city,file("city.btx")[email protected]()),env(employee,file("employee.btx")[email protected]()),... |
| 3 | // When the system is initialized , Several small tables are read into memory |
Inquire about
| A | |
| 1 | =file("orders.ctx").open().cursor(cid,eid,…).switch(cid,city:#;eid,employee:#;…) |
| 2 | =A1.select(cid.state='New York' && eid.title=="manager"…) |
| 3 | // First, the serial number is associated , Then reference the associated table fields to write the filter criteria |
High concurrency of account queries SPL Code :
Data preprocessing , Orderly storage
| A | B | |
| 1 | =file("T-original.ctx").open().cursor(id,tdate,amt,…) | |
| 2 | =A1.sortx(id) | =file("T.ctx") |
| 3 | [email protected](#id,tdate,amt,…)[email protected](A2) | |
| 4 | =B2.open().index(index_id;id) | |
| 5 | // Sort the original data , Save as new table , And index the account number | |
Account query
| A | B | |
| 1 | =T.icursor(;id==10100 && tdate>=date("2021-01-10") && tdate<date("2021-01-25") && …,index_id).fetch() | |
| 2 | // The query code is very simple | |
Apart from these simple examples ,SPL More high-performance algorithms can be implemented , For example, orderly merging realizes the association between orders and details 、 Pre association technology realizes multi-layer dimension table Association in multi-dimensional analysis 、 Bit storage technology to achieve thousands of tag statistics 、 Boolean set technology can speed up the query of multiple enumeration value filter conditions 、 Timing grouping technology realizes complex funnel analysis and so on .
Is for SQL Performance optimization headache partners , Come and discuss with us :
《 Unbearably slow query run batch 》
Identify the QR code and open the page

blockbuster ! Open source SPL The exchange group was established
Easy to use SPL Open source !
In order to provide a platform for interested partners to communicate with each other ,
Specially opened an exchange group ( The group is completely free , No advertising, no classes )
Friends who need to join the group , Long press to scan the QR code below
Friends interested in this article , Please go to reading the original text to collect ^_^
边栏推荐
- Restore the binary search tree [simulate according to the meaning of the question - > find the problem - > analyze the problem - > see the bidding]
- 【TcaplusDB知识库】TcaplusDB表数据缓写介绍
- 【线程通信】
- 高性能限流器 Guava RateLimiter
- [flutter topic] 66 diagram basic constraints box (I) yyds dry goods inventory
- Yyds dry inventory difference between bazel and gradle tools
- 做 SQL 性能优化真是让人干瞪眼
- 20款IDEA 神级插件 效率提升 30 倍,写代码必备
- Unable to locate program input point [email protected]
- 【TcaplusDB】祝大家端午安康!
猜你喜欢

如何理解MySQL的索引?

初探元宇宙存储,数据存储市场下一个爆点?
![[tcapulusdb knowledge base] Introduction to tcapulusdb tcapsvrmgr tool (III)](/img/7b/8c4f1549054ee8c0184495d9e8e378.png)
[tcapulusdb knowledge base] Introduction to tcapulusdb tcapsvrmgr tool (III)

【TcaplusDB知识库】TcaplusDB表数据缓写介绍

2D human posture estimation deeppose
[email protected]"/>无法定位程序输入点 [email protected]
[email protected]"/>Unable to locate program input point [email protected]

How to understand MySQL indexes?

【TcaplusDB知识库】TcaplusDB-tcapulogmgr工具介绍(二)
![[yunyuanyuan] it's so hot. Why don't you come and understand it?](/img/a8/99037ec5b796e39b9e76eac95deb86.png)
[yunyuanyuan] it's so hot. Why don't you come and understand it?
随机推荐
[tcapulusdb knowledge base] modify business modify cluster
[tcapulusdb knowledge base] Introduction to tcapulusdb tcapsvrmgr tool (III)
目前市面上增额终身寿险利率最高的产品是哪个?
Problème - Ajouter shellerror: permissions d'instrumentation pour le périphérique: vérifier les règles udev.
Requirements analysis specification and requirements specification
VG4131SxxxN0S1无线模块硬件规格书
[tcapulusdb knowledge base] tcapulusdb technical support introduction
FarrowTech的无线传感器采用橙群微电子的NanoBeacon蓝牙信标技术
Set hardware breakpoint instruction for ejtag under the PMON of the Godson development board
【TcaplusDB知识库】TcaplusDB-tcapulogmgr工具介绍(二)
[dynamic planning] change exchange
Supplement to the scheme of gateway+nacos+knife4j (swagger)
Open source demo| you draw and I guess -- make your life more interesting
What is the gold content of the equipment supervisor certificate? Is it worth it?
高性能限流器 Guava RateLimiter
Input input box click with border
【雲原生】這麼火,你不來了解下?
Same tree [from part to whole]
Access 500 error after modstart migrates the environment
迅为i.MX8M开发板yocto系统使用Gstarwmr视频转换