当前位置：网站首页>JD Sanmian: I want to query a table with tens of millions of data. How can I operate it?

JD Sanmian: I want to query a table with tens of millions of data. How can I operate it?

2022-07-26 16:44:00 【chenzixia】

Preface

interviewer ： Say it , Ten million data , How did you inquire ？
I ： Direct paging query , Use limit Pagination .
interviewer ： Have you ever practiced ？
I ： There must be

Here's a song 《 be doomed 》

Maybe some people haven't met a table with tens of millions of data , It's not clear what happens when you query tens of millions of data .

Today, let's take you to practice , This time it's based on MySQL 5.7.26 Do a test

Prepare the data

What to do without 10 million data ？

Create it

Code to create 10 million ？ That's impossible , Too slow , Maybe you really have to run all day . You can use database scripts to execute much faster .

Create table

CREATE TABLE `user_operation_log` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` varchar(64) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`ip` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`op_data` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr1` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr2` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr3` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr4` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr7` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr8` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr9` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr10` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr11` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
`attr12` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NULL DEFAULT NULL,
PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

Create data script

Using batch insert , It will be much faster , And every 1000 The number is commit, Too much data , It will also lead to slow batch insertion efficiency

DELIMITER ;
;
CREATE PROCEDURE batch_insert_log()BEGIN DECLARE i iNT DEFAULT 1;
DECLARE userId iNT DEFAULT 10000000;
set @execSql = 'INSERT INTO `test`.`user_operation_log`(`user_id`, `ip`, `op_data`, `attr1`, `attr2`, `attr3`, `attr4`, `attr5`, `attr6`, `attr7`, `attr8`, `attr9`, `attr10`, `attr11`, `attr12`) VALUES';
set @execData = '';
WHILE i<=10000000 DO set @attr = "' Test long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long, long properties '";
set @execData = concat(@execData, "(", userId + i, ", '10.0.69.175', ' User login operation '", ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ",", @attr, ")");
if i % 1000 = 0 then set @stmtSql = concat(@execSql, @execData,";");
prepare stmt from @stmtSql;
execute stmt;
DEALLOCATE prepare stmt;
commit;
set @execData = ""; else set @execData = concat(@execData, ",");
end if;
SET i=i+1;
END WHILE;
END;
;
DELIMITER ;

Start testing

Brother's computer configuration is relatively low ：win10 Standard pressure slag i5 About reading and writing 500MB Of SSD

Due to low configuration , For this test, only 3148000 Data , Disk occupied 5G( Without indexing ), ran 38min, Students with computer configuration , You can insert multipoint data test

SELECT count(1) FROM `user_operation_log`

Return results ：3148000

The time of three queries is respectively ：

14060 ms
13755 ms
13447 ms

General paging query

MySQL Support LIMIT Statement to select the specified number of data , Oracle have access to ROWNUM To select .

MySQL The syntax of paging query is as follows ：

SELECT * FROM table LIMIT [offset,] rows | rows OFFSET offset

The first parameter specifies the offset of the first return record line
The second parameter specifies the maximum number of rows to return records

Now let's start testing the query results ：

SELECT * FROM `user_operation_log` LIMIT 10000, 10

Inquire about 3 The times are ：

59 ms
49 ms
50 ms

It seems that the speed is OK , It's just a local database , Speed naturally faster .

Test from another angle

Same offset , Different amount of data

SELECT * FROM `user_operation_log` LIMIT 10000, 10
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 1000
SELECT * FROM `user_operation_log` LIMIT 10000, 10000
SELECT * FROM `user_operation_log` LIMIT 10000, 100000
SELECT * FROM `user_operation_log` LIMIT 10000, 1000000

The query time is as follows ：

Quantity first, second, third 10 strip 53ms52ms47ms100 strip 50ms60ms55ms1000 strip 61ms74ms60ms10000 strip 164ms180ms217ms100000 strip 1609ms1741ms1764ms1000000 strip 16219ms16889ms17081ms

From the above results, we can conclude that ： More data , The longer it takes

The same amount of data , Different offset

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT * FROM `user_operation_log` LIMIT 1000, 100
SELECT * FROM `user_operation_log` LIMIT 10000, 100
SELECT * FROM `user_operation_log` LIMIT 100000, 100
SELECT * FROM `user_operation_log` LIMIT 1000000, 100

Offset the first time, the second time, the third time 10036ms40ms36ms100031ms38ms32ms1000053ms48ms51ms100000622ms576ms627ms10000004891ms5076ms4856ms

From the above results, we can conclude that ： The greater the offset , The longer it takes

SELECT * FROM `user_operation_log` LIMIT 100, 100
SELECT id, attr FROM `user_operation_log` LIMIT 100, 100

How to optimize

Now that we have gone through the above toss , The conclusion is also drawn , In view of the above two problems ： Large offset 、 Large amount of data , Let's optimize separately

Optimization of large offset

Adopt sub query method

We can locate the offset position first id, Then query the data

SELECT * FROM `user_operation_log` LIMIT 1000000,
10SELECT id FROM `user_operation_log` LIMIT 1000000,
1SELECT * FROM `user_operation_log` WHERE id >= (
SELECT id FROM `user_operation_log` LIMIT 1000000,
1
) LIMIT 10

The query results are as follows ：

sql Take time first 4818ms Second ( Without index )4329ms Second ( With index )199ms Article 3 the ( Without index )4319ms Article 3 the ( With index )201ms

Draw a conclusion from the above results ：

The first one takes the most time , The third is a little better than the first
Subqueries use indexes faster

shortcoming ： Only applicable to id Incremental situation

id In the case of non increment, the following expression can be used , But this disadvantage is that paging queries can only be placed in sub queries

Be careful ： some mysql Version not supported in in Used in clauses limit, Therefore, multiple nested select

SELECT * FROM `user_operation_log` WHERE id IN (
SELECT t.id FROM (
SELECT id FROM `user_operation_log` LIMIT 1000000,
10
) AS t
)

use id Limiting mode

This method is more demanding ,id Must be continuously increasing , And it has to be calculated id The scope of the , And then use between,sql as follows

SELECT * FROM `user_operation_log` WHERE id between 1000000 AND 1000100 LIMIT 100SELECT * FROM `user_operation_log` WHERE id >= 1000000 LIMIT 100

The query results are as follows ：

sql Take time first 22ms Second 21ms

From the results, we can see that this method is very fast

Be careful ： there LIMIT Yes, the number is limited , No offset is used

Optimize the problem of large amount of data

The amount of data returned will also directly affect the speed

SELECT * FROM `user_operation_log` LIMIT 1, 1000000
SELECT id FROM `user_operation_log` LIMIT 1, 1000000
SELECT id, user_id, ip, op_data, attr1, attr2, attr3, attr4, attr5, attr6, attr7, attr8, attr9, attr10, attr11, attr12 FROM `user_operation_log` LIMIT 1, 1000000

The query results are as follows ：

sql Take time first 15676ms Second 7298ms Article 3 the 15960ms

From the results, we can see that we can reduce the unnecessary Columns , Query efficiency can also be significantly improved

The first and third queries are about the same speed , You will definitely make complaints about it. , Then why do I write so many fields , direct * No, it's over

Pay attention to my MySQL The server and client are in The same machine On , So the query data is not much different , Conditional students can test the client and MySQL Separate

SELECT * Doesn't it smell good ？

By the way, I would like to add why SELECT *. Is it simple and brainless , Doesn't it smell good ？

Two main points ：

use "SELECT * " The database needs to parse more objects 、 Field 、 jurisdiction 、 Properties and other related content , stay SQL Complex sentences , In the case of more hard parsing , It's a huge burden on the database .
Increase network overhead ,* Sometimes it will be mistakenly taken with log、IconMD5 Such useless and large text fields , The data transfer size Will grow geometrically . especially MySQL Not on the same machine as the application , This kind of expense is very obvious .