当前位置:网站首页>Actual combat! Talk about how to solve the deep paging problem of MySQL
Actual combat! Talk about how to solve the deep paging problem of MySQL
2022-07-29 06:51:00 【Ma Xiaoxiao】
When we do paging requirements everyday , Tend to use limit Realization , But when the offset is particularly large , Query efficiency becomes low . This paper will be divided into four schemes , Discuss how to optimize MySQL Deep paging of millions of data , The latest optimized production plan is attached SQL The actual combat cases of .
limit Why does deep paging slow down ?
First look at the structure of the table below :
CREATE TABLE account (
id int(11) NOT NULL AUTO_INCREMENT COMMENT ' Primary key Id',
name varchar(255) DEFAULT NULL COMMENT ' Account name ',
balance int(11) DEFAULT NULL COMMENT ' balance ',
create_time datetime NOT NULL COMMENT ' Creation time ',
update_time datetime NOT NULL ON UPDATE CURRENT_TIMESTAMP COMMENT ' Update time ',
PRIMARY KEY (id),
KEY idx_name (name),
KEY idx_update_time (update_time) // Indexes
) ENGINE=InnoDB AUTO_INCREMENT=1570068 DEFAULT CHARSET=utf8 ROW_FORMAT=REDUNDANT COMMENT=' Account form ';
Suppose the execution of deep paging SQL as follows :
select id,name,balance from account where update_time> '2020-09-19' limit 100000,10;
This SQL The execution time is as follows :
It takes 0.742 second , Why deep paging Slow down Well ? If replaced limit 0,10
, It only needs 0.006 Seconds
Let's take a look at this SQL The implementation process of :
adopt Ordinary secondary index tree idx_update_time, Filter update_time Conditions , Find records that meet the criteria ID.
adopt ID, go back to Primary key index tree , Find the line that meets the record , Then take out the displayed Columns ( Back to the table )
Scan the ones that meet the conditions 100010 That's ok , Then throw it away before 100000 That's ok , return .
Implementation plan as follows :
SQL There are two reasons for slowing down :
limit Statement will scan offset+n That's ok , And then throw it away offset That's ok , After the return n Row data . in other words
limit 100000,10
, It will scan 100010 That's ok , andlimit 0,10
, Just scan 10 That's ok .limit 100000,10
Scan more lines , Also means that Back to the table More times .
Optimize through subqueries
Because of the above SQL, Back to the table 100010 Time , actually , We just need 10 Data , That is, we just need to 10 It's enough to go back to the table once . therefore , We can go through Reduce the number of times to return to the table To optimize .
review B+ Tree structure
that , How to reduce the number of times to return to the table ? Let's review first B+ Tree index structure ~
InnoDB in , The index is divided into primary key indexes ( Cluster index ) And secondary index
primary key , The leaf node stores the whole line of data
Secondary indexes , Leaf nodes store The value of the primary key .
Transfer the condition to the primary key index tree
If we put the query criteria , Move back to the primary key index tree , Then you can reduce the number of times to return to the meter . If the query is transferred to the primary key index tree , The query criteria have to be changed to Primary key id
了 , Before SQL Of update_time
What about these conditions ? Draw to Subquery There ~
How did you get it from the sub query ? Because the secondary index leaf node has a primary key ID Of , So we're directly based on update_time
Let's check the primary key ID that will do , At the same time, we put limit 100000
Conditions , Also transfer to subquery , complete SQL as follows :
select id,name,balance FROM account where id >= (select a.id from account a where a.update_time >= '2020-09-19' limit 100000, 1) LIMIT 10; It's missing , You can make up for the time and conditions outside
The query effect is the same , Execution time only needs 0.038 second !
Let's take a look at the implementation plan
From the execution plan , Subquery table a The query uses idx_update_time
Indexes . First, get the primary key of the clustered index on the index ID, The operation of returning to the table is omitted , Then the second query is directly based on the first query ID Check later 10 Just one !
therefore , This plan is OK ~
INNER JOIN Delayed correlation
Optimization idea of delay correlation , The optimization idea is actually the same as that of sub query : It is to transfer the conditions to the primary key index tree , Then reduce back to table . The difference is , Deferred correlation uses inner join Instead of subquery .
The optimized SQL as follows :
SELECT acct1.id,acct1.name,acct1.balance FROM account acct1 INNER JOIN (SELECT a.id FROM account a WHERE a.update_time >= '2020-09-19' ORDER BY a.update_time LIMIT 100000, 10) AS acct2 on acct1.id= acct2.id;
The query effect is also leveraged , It only needs 0.034 second
The execution plan is as follows :
The query idea is , Through the first idx_update_time
The secondary index tree queries the qualified primary keys ID, And then the primary key with the original table ID Internal connection , In this way, the primary key index is directly followed , At the same time, it also reduces the number of back tables .
Label recording method
limit The essential reason for the deep paging problem is : Offset (offset) The bigger it is ,mysql The more lines you scan , Then abandon . This leads to a decline in query performance .
In fact, we can use Tag records Law , Just mark which one was found last time , Next time we check , Scan down from this bar . It's like reading a Book , Where did you see last time , You just fold it or put a bookmark , Next time I come to see , Just turn to .
Suppose... Was recorded last time 100000, be SQL It can be changed to :
select id,name,balance FROM account where id > 100000 order by id limit 10;
In this case , No matter how many pages you turn back , The performance will be very good , Because I hit id
Indexes . But this way There are limitations : You need a field similar to continuous self increment .
Use between...and...
A lot of times , Can be limit
Queries are converted to queries of known locations , such MySQL Through range scanning between...and
, You can get the corresponding results .
If you know the boundary value is 100000,100010 after , You can optimize :
select id,name,balance FROM account where id between 100000 and 100010 order by id;
Hands on combat cases
Let's take a look at a practical case . Suppose there is a table structure as follows , And there are 200 ten thousand data .
CREATE TABLE account (
id varchar(32) COLLATE utf8_bin NOT NULL COMMENT ' Primary key ',
account_no varchar(64) COLLATE utf8_bin NOT NULL DEFAULT '' COMMENT ' account number '
amount decimal(20,2) DEFAULT NULL COMMENT ' amount of money '
type varchar(10) COLLATE utf8_bin DEFAULT NULL COMMENT ' type A,B'
create_time datetime DEFAULT NULL COMMENT ' Creation time ',
update_time datetime DEFAULT NULL COMMENT ' Update time ',
PRIMARY KEY (id),
KEY `idx_account_no` (account_no),
KEY `idx_create_time` (create_time)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin COMMENT=' Account form '
The business requirements are : Get the most 2021 Year of A Type account data , Report to big data platform .
The implementation of the general idea
Many partners receive such a demand , It will be realized directly :
// Query the total reported quantity
Integer total = accountDAO.countAccount();
// Query the... Corresponding to the total reported quantity SQL
<select id ='countAccount' resultType="java.lang.Integer">
seelct count(1)
from account
where create_time >='2021-01-01 00:00:00'
and type ='A'
</select>
// Count pages
int pageNo = total % pageSize == 0 ? total / pageSize : (total / pageSize + 1);
// Paging query , Report
for(int i = 0; i < pageNo; i++){
List<AcctountPO> list = accountDAO.listAccountByPage(startRow,pageSize);
startRow = (pageNo-1)*pageSize;
// Report big data
postBigData(list);
}
// Paging query SQL( Possible limit Deep paging problem , because account There are millions of data in the table )
<select id ='listAccountByPage' >
seelct *
from account
where create_time >='2021-01-01 00:00:00'
and type ='A'
limit #{startRow},#{pageSize}
</select>
Practical optimization scheme
The above implementation scheme , Will exist limit Deep paging problem , because account There are millions of data in the table . How to optimize it ?
You can actually use Tag records Law , Some partners may have doubts ,id The primary key is not continuous Yeah , You can really use labels to record ?
Certainly. ,id It's not continuous , We can go through order by
Let it continue . The optimization scheme is as follows :
// Query minimum ID
String lastId = accountDAO.queryMinId();
// Query minimum ID Corresponding SQL
<select id="queryMinId" returnType=“java.lang.String”>
select MIN(id)
from account
where create_time >='2021-01-01 00:00:00'
and type ='A'
</select>
// Number of entries on a page
Integer pageSize = 100;
List<AcctountPO> list ;
do{
list = listAccountByPage(lastId,pageSize);
// Label recording method , Record the last query Id
lastId = list.get(list,size()-1).getId();
// Report big data
postBigData(list);
}while(CollectionUtils.isNotEmpty(list));
<select id ="listAccountByPage">
select *
from account
where create_time >='2021-01-01 00:00:00'
and id > #{lastId}
and type ='A'
order by id asc
limit #{pageSize}
</select>
边栏推荐
- Hongke automation SoftPLC | Hongke kPa modk operation environment and construction steps (2) -- modk operation environment construction
- 分享一些你代码更好的小建议,流畅编码提搞效率
- 【笔记】The art of research(明白问题的重要性)
- ping 原理
- Shallow reading of condition object source code
- 量子机器学习中的安全性问题
- Biased lock, lightweight lock test tool class level related commands
- NLP word segmentation
- 数据库持久化+JDBC数据库连接
- 王树尧老师运筹学课程笔记 09 线性规划与单纯形法(单纯形表的应用)
猜你喜欢
成长为架构师途中的一些思考
Hongke solution | a unique solution to realize seamless integration at low cost in Digital Substations
循环神经网络RNN
Hongke shares | why EtherCAT is the best solution to improve the performance of the control system?
SQL developer graphical window to create database (tablespace and user)
将源码包转换为rpm包
会话推荐中的价格偏好和兴趣偏好共同建模-论文泛读
Hongke share | let you have a comprehensive understanding of "can bus error" (III) -- can node status and error counter
Etcd principle
Ping principle
随机推荐
Shell脚本-全局变量、局部变量、环境变量
Recurrent neural network RNN
【笔记】The art of research(明白问题的重要性)
MySql基础知识(高频面试题)
基于Matlab解决线性规划问题
5g service interface and reference point
N2 interface of 5g control plane protocol
Using STP spanning tree protocol to solve the problem of two-layer loop in network
C语言数据类型
Why does 5g N2 interface control plane use SCTP protocol?
ping 原理
阿里一面,给了几条SQL,问需要执行几次树搜索操作?
Instruction rearrangement under multithreading concurrency
【论文阅读 | cryoET】Gum-Net:快速准确的3D Subtomo图像对齐和平均的无监督几何匹配
'function VTable for error: undefined reference to... 'cause and solution of the problem
【讲座笔记】如何在稀烂的数据中做深度学习?
LDAP简述及统一认证说明
Annotation
NLP-分词
王树尧老师运筹学课程笔记 07 线性规划与单纯形法(标准型、基、基解、基可行解、可行基)