当前位置：网站首页>Doing SQL performance optimization is really eye-catching

Doing SQL performance optimization is really eye-catching

2022-07-05 06:07:00 【Unknown architect】

Many big data calculations use SQL Realized , When you run slowly, you have to optimize SQL, But we often encounter situations that make people stare .

such as , There are three statements in the stored procedure that are roughly like this, which execute very slowly ：

 select a,b,sum(x) from T group by a,b where …;   
 select c,d,max(y) from T group by c,d where …;  
 select a,c,avg(y),min(z) from T group by a,c where …;

there T It's a huge watch with hundreds of millions of lines , To group in three ways , The result set of grouping is not large .

The grouping operation needs to traverse the data table , These three sentences SQL You have to traverse this big table three times , It takes a long time to traverse hundreds of millions of rows of data , Not to mention three times .

In this grouping operation , Relative to the time of traversing the hard disk ,CPU The calculation time is almost negligible . If you can calculate the summary of multiple groups in one traversal , although CPU The amount of calculation has not decreased , But it can greatly reduce the amount of data read from the hard disk , You can double the speed .

If SQL Support syntax like this ：

from T -- The data come from T surface  
 select a,b,sum(x) group by a,b where …   -- The first grouping in traversal  
 select c,d,max(y) group by c,d where …   -- The second group in traversal  
 select a,c,avg(y),min(z) group by a,c where …; -- The third group in traversal

Can return multiple result sets at a time , Then you can greatly improve the performance .

unfortunately , SQL There is no such grammar , Can't write such a statement , Only one alternative , Just use group a,b,c,d First calculate a more detailed grouping result set , But first save it as a temporary table , To further use SQL Calculate the target result .SQL As follows ：

 create table T\_temp as select a,b,c,d,  
     sum(case when … then x else 0 end) sumx,  
     max(case when … then y else null end) maxy,   
     sum(case when … then y else 0 end) sumy,   
     count(case when … then 1 else null end) county,  
     min(case when … then z else null end) minz 
     group by a,b,c,d;
 select a,b,sum(sumx) from T\_temp group by a,b where …;  
 select c,d,max(maxy) from T\_temp group by c,d where …;  
 select a,c,sum(sumy)/sum(county),min(minz) from T\_temp group by a,c where …;

So just traverse once , But take different WHERE The condition goes to the previous case when in , The code is much more complex , It will also increase the amount of calculation . and , When calculating the temporary table, the number of grouping fields becomes large , The result set can be very large , Finally, the temporary table is traversed many times , Computing performance is not fast . Large result set grouping calculation also needs hard disk cache , Its performance is also very poor .

You can also use the database cursor of the stored procedure to put data one by one fetch Come out and calculate , But it has to be done all by yourself WHERE and GROUP The action of , It's too cumbersome to write , The performance of database cursor traversing data will only be worse ！

Just stare ！

TopN Operation will also encounter this helplessness . for instance , use Oracle Of SQL Write top5 It looks something like this ：

 select \* from (select x from T order by x desc) where rownum<=5

surface T Yes 10 Billion data , from SQL Look at the sentences , Is to sort all the data before taking it out 5 name , The rest of the sorting results are useless ！ Large sorting costs a lot , The amount of data is too large to fit in memory , There will be multiple hard disk data switching , Computing performance will be very poor ！

It's not hard to avoid big sorting , Keep a... In memory 5 A small collection of records , When traversing data , Before the calculated data 5 Names are preserved in this small set , If the new data obtained is more than the current second 5 Famous , Then insert it and throw away the present 5 name , If it's better than the current 5 The name should be small , Do not act . To do so , As long as the 10 100 million pieces of data can be traversed once , And the memory consumption is very small , Computing performance will be greatly improved .

The essence of this algorithm is to TopN It is also regarded as the sum of 、 Count the same aggregation operation , It just returns a collection, not a single value .SQL If it could be written like this , You can avoid big sorting ：

 select top(x,5) from T

Unfortunately ,SQL There is no explicit set data type , Aggregate functions can only return single values , Can't write such a statement ！

But the good thing is that the whole episode TopN Relatively simple , although SQL Write like that , The database is usually optimized in Engineering , Use the above method to avoid large sorting . therefore Oracle Count that SQL Not slow .

however , If TopN The situation is complicated , Used in subqueries or with JOIN When we get together , Optimization engines usually don't work . For example, after grouping, calculate the of each group TopN, use SQL It's a little difficult to write .Oracle Of SQL It's written like this ：

 select \* from (select y,x,row\_number() over (partition by y order by x desc) rn from T) where rn<=5

Now , The database optimization engine is dizzy , We will not use the above method to TopN Understand the method of aggregation operation . I have to sort , As a result, the operation speed drops sharply ！

If SQL The grouping TopN Can write like this ：

 select y,top(x,5) from T group by y

hold top As and sum The same aggregate function , It's not only easier to read , And it's easy to calculate at high speed .

unfortunately , no way .

Still stare ！

Correlation calculation is also very common . Take the filtering calculation after an order is associated with multiple tables as an example ,SQL It's roughly like this ：

select o.oid,o.orderdate,o.amount 
from orders o 
    left join city ci on o.cityid = ci.cityid 
    left join shipper sh on o.shid=sh.shid 
    left join employee e on o.eid=e.eid 
    left join supplier su on o.suid=su.suid 
where ci.state='New York' 
    and e.title='manager' 
    and ...

The order form has tens of millions of data , City 、 Shippers 、 Employee 、 The data of suppliers and other tables are not large . The filter criteria fields may come from these tables , And the front end transmits parameters to the background , It's dynamic .

SQL It is generally used HASH JOIN The algorithm implements these associations , To calculate HASH Value and compare . Only one... Can be parsed at a time JOIN, Yes N individual JOIN To execute N Pass action , After each association, you need to keep the intermediate results for the next round , The calculation process is complex , The data will also be traversed many times , Poor computing performance .

Usually , These associated code tables are small , You can read it into memory first . If each associated field in the order table is serialized in advance , For example, convert the employee number field value to the serial number of the corresponding employee table record . So when calculating , You can use the employee number field value （ That is, the employee table serial number ）, Directly get the record of the corresponding position of the employee table in memory , Performance ratio HASH JOIN Much faster , And you only need to traverse the order table once , The speed increase will be very obvious ！

That is, you can put SQL Write it as follows ：

select o.oid,o.orderdate,o.amount 
from orders o 
    left join city c on o.cid = c.# -- The city number of the order form is through the serial number # Associated city table  
    left join shipper sh on o.shid=sh.# -- Order form shipper number through serial number # Associated shippers table  
    left join employee e on o.eid=e.# -- The employee number of the order form is by serial number # Associated employee table  
    left join supplier su on o.suid=su.#-- The supplier number of the order form is passed through the serial number # Associated supplier table  
where ci.state='New York' 
    and e.title='manager' 
    and ...

It is a pity ,SQL The concept of unordered set is used , Even if these numbers have been numbered , Databases can't take advantage of this feature , The mechanism of rapid sequence number positioning cannot be used on these unordered sets of corresponding association tables , Only the index can be used to find , And the database doesn't know that the number is serialized , Still calculate HASH Value and comparison , The performance is still very poor ！

There are good methods that can't be implemented , Can only stare again ！

And highly concurrent account queries , This operation is very simple ：

select id,amt,tdate,… from T 
where id='10100' 
    and tdate>= to\_date('2021-01-10','yyyy-MM-dd') 
    and tdate<to_date('2021-01-25','yyyy-mm-dd') 
    and="" …="" <p="">

stay T In the hundreds of millions of historical data in the table , Quickly find several to thousands of details of an account ,SQL It's not complicated to write , The difficulty is that the response speed should reach the second level or even faster in case of large concurrency . In order to improve query response speed , It's usually about T Tabular id Field indexing ：

create index index_T_1 on T(id)

In the database , It's fast to find a single account with an index , But when there is a lot of concurrency, it will obviously slow down . The reason is mentioned above SQL Theoretical basis of disorder , The total amount of data is very large , Unable to read all into memory , The database cannot guarantee that the data of the same account is physically stored continuously . The hard disk has the smallest reading unit , When reading discontinuous data , Will take out a lot of irrelevant content , Queries will slow down . Each query with high concurrent access is slower , The overall performance will be very poor . At a time when experience is very important , Who dares to let users wait more than ten seconds ？！

The easy way to think of is , Sort hundreds of millions of data in advance according to accounts , Ensure the continuous storage of data in the same account , Almost all the data blocks read out from the hard disk during query are target values , Performance will be greatly improved .

however , use SQL The relational database of the system does not have this awareness , The physical order of data storage is not enforced ！ The problem is not SQL Caused by grammar , But also with SQL The theoretical basis of , There is still no way to implement these algorithms in relational databases .

To do that ？ Can only stare ？

No more SQL And relational databases , To use another computing engine .

Open source concentrator SPL Based on the theoretical basis of innovation , Support more data types and operations , Be able to describe the new algorithm in the above scenario . Use simple and convenient SPL Write code , It can greatly improve the computing performance in a short time ！

The above questions use SPL The code example written is as follows ：

Multiple groups are calculated in one traversal

	A	B
1	=file(“T.ctx”).open().cursor(a,b,c,d,x,y,z
2	cursor A1	=A2.select(…).groups(a,b;sum(x))
3		// Define the first filter in traversal 、 grouping
4	cursor	=A4.select(…).groups(c,d;max(y))
5		// Define the second filter in traversal 、 grouping
6	cursor	=A6.select(…).groupx(a,c;avg(y),min(z))
7		// Define the third filter in traversal 、 grouping
8	…	// End of definition , Start to calculate the three ways of filtering 、 grouping

Calculate by aggregation Top5

The complete Top5（ Multithreaded parallel computing ）

	A
1	=file(“T.ctx”).open()
2	[email protected](x).total(top(-5,x),top(5,x))
3	//top(-5,x) To calculate the x The biggest front 5 name ,top(5,x) yes x The smallest front 5 name .

grouping Top5（ Multithreaded parallel computing ）

	A
1	=file(“T.ctx”).open()
2	[email protected](x,y).groups(y;top(-5,x),top(5,x))

Use the serial number as the associated SPL Code ：

System initialization

	A
1	>env(city,file(“city.btx”)[email protected]()),env(employee,file(“employee.btx”)[email protected]()),…
2	// When the system is initialized , Several small tables are read into memory

Inquire about

	A
1	=file(“orders.ctx”).open().cursor(cid,eid,…).switch(cid,city:#;eid,employee:#;…)
2	=A1.select(cid.state==“New York” && eid.title==“manager”…)
3	// First, the serial number is associated , Then reference the associated table fields to write the filter criteria

High concurrency of account queries SPL Code ：

Data preprocessing , Orderly storage

	A	B
1	=file(“T-original.ctx”).open().cursor(id,tdate,amt,…)
2	=A1.sortx(id)	=file(“T.ctx”)
3	[email protected](#id,tdate,amt,…)[email protected](A2)
4	=B2.open().index(index_id;id)
5	// Sort the original data , Save as new table , And index the account number

Account query

	A
1	=T.icursor(;id==10100 && tdate>=date(“2021-01-10”) && tdate<date(“2021-01-25”) && …,index_id).fetch()
2	// The query code is very simple

Apart from these simple examples ,SPL More high-performance algorithms can be implemented , For example, orderly merging realizes the association between orders and details 、 Pre association technology realizes multi-layer dimension table Association in multi-dimensional analysis 、 Bit storage technology to achieve thousands of tag statistics 、 Boolean set technology can speed up the query of multiple enumeration value filter conditions 、 Timing grouping technology realizes complex funnel analysis and so on .

Is for SQL Performance optimization headache partners , Can discuss with us ：

http://www.raqsoft.com.cn/wx/Query-run-batch-ad.html