当前位置：网站首页>The maximum recommended number of rows for MySQL is 2000W. Is it reliable?

The maximum recommended number of rows for MySQL is 2000W. Is it reliable?

2022-07-28 09:32:00 【JD technology developer】

1 background

As an old driver driving in the back circle for many years , Have you often heard ,“mysql A single watch should not exceed 2000w”,“ A single watch exceeds 2000w It's time to consider data migration ”,“ The data of your table will arrive soon 2000w 了 , No wonder the query speed is slow ”

These famous folk sayings are similar to “ The group only discusses technology , Don't drive , Don't drive faster than 120 code , Otherwise, it will kick the group automatically ”, Only heard of it , Never tried. , ha-ha .

Now let's step on the speed to the end , Dry to 180 Give it a try …….

2 experiment

Experiment and have a look …
Build a table

CREATE TABLE person(
id int NOT NULL AUTO_INCREMENT PRIMARY KEY comment ' Primary key ',
person_id tinyint not null comment ' user id',
person_name VARCHAR(200) comment ' User name ',
gmt_create datetime comment ' Creation time ',
gmt_modified datetime comment ' Modification time '
) comment ' Personnel information sheet ';

Insert a piece of data

insert into person values(1,1,'user_1', NOW(), now());

utilize mysql Pseudo column rownum Set the pseudo column starting point to 1

select (@i:[email protected]+1) as rownum, person_name from person, (select @i:=100) as init;
set @i=1;

Run the following sql, Continuous execution 20 Time , Namely 2 Of 20 The power is about 100w The data of ; perform 23 The next is 2 Of 23 The power is about 800w , In this way, tens of millions of test data can be inserted , If you don't want to double the number , But think a little , A small increase , There's a trick , Is in the SQL Add where Conditions , Such as id> A certain value can be used to control the increased amount of data .

insert into person(id, person_id, person_name, gmt_create, gmt_modified)
select @i:[email protected]+1,
left(rand()*10,10) as person_id,
concat('user_',@i%2048),
date_add(gmt_create,interval + @i*cast(rand()*100 as signed) SECOND),
date_add(date_add(gmt_modified,interval [email protected]*cast(rand()*100 as signed) SECOND), interval + cast(rand()*1000000 as signed) SECOND)
from person;

Note here , Maybe you're getting close 800w perhaps 1000w Data time , Will report a mistake ：The total number of locks exceeds the lock table size, This is because your temporary table memory is not set large enough , Just expand the setting parameters .

SET GLOBAL tmp_table_size =512*1024*1024; （512M）
SET global innodb_buffer_pool_size= 1*1024*1024*1024 (1G);

Let's first look at a set of test data , This set of data is in mysql8.0 Version of , And it's on my computer , Because this machine is still running idea , Browser and other tools , So it is not machine configuration or database configuration , So the test data is limited to reference .

It seems that this group of data really corresponds to the title , When the data reaches 2000w in the future , The query duration rises sharply ; Is this the iron rule ？

Now let's take a look at the recommended value 2kw How did you get it ？

3 Single table quantity limit

First, let's think about the maximum number of rows in a single table of the database ？

CREATE TABLE person(
id int(10) NOT NULL AUTO_INCREMENT PRIMARY KEY comment ' Primary key ',
person_id tinyint not null comment ' user id',
person_name VARCHAR(200) comment ' User name ',
gmt_create datetime comment ' Creation time ',
gmt_modified datetime comment ' Modification time '
) comment ' Personnel information sheet ';

Look at the table above sql,id It's the primary key , Itself is the only , In other words, the size of the primary key can limit the upper limit of the table , If the primary key declares int size , That is to say 32 position , So support 2^32-1 ~~21 Billion ; If it is bigint, That's it 2^62-1 ？（36893488147419103232）, It's hard to imagine how big this is , Generally, it is not before this limit , Maybe the database is full ！！
Someone counted it , If you build a watch , The auto increment field selects unsigned bigint , Then the maximum value of self growth is 18446744073709551615, Add a new record per second , About when will it be used up ？

4 Table space

Now let's take a look at the structure of the index , by the way , What we will talk about next is based on Innodb Engine , Everybody knows Innodb The internal index of is B+ Trees

The data in this table , Storing on hard disk is similar , It is actually placed in a place called person.ibd （innodb data） In the file of , Also called table space ; Although the data sheet , They seem to be connected one by one , But in fact, it is divided into many small data pages in the document , And every one is 16K. It's like this , Of course, this is just our abstraction , There is another segment in the table space 、 District 、 Group and many other concepts , But we need to jump out and see .

5 The data structure of the page

Because each page only 16K Size , But if there is a lot of data , There must be no room for these data on that page , Then the data will be divided into other pages , So in order to link these pages , There must be a record of the front and back page addresses , It is convenient to find the corresponding page ; At the same time, every page is unique , Then you need a unique logo to mark the page , It's the page number ; Data will be recorded in the page, so there will be read and write operations , There will be interrupts or other exceptions in the read-write operation, resulting in incomplete data , Then you need a verification mechanism , So there is also a check code in it , The most important thing about read operation is efficiency , If you traverse one by one according to the records , That must be very laborious , Therefore, the corresponding page directory will be generated for the data （Page Directory）; So the internal structure of the actual page is like the following .

As you can see from the diagram , One InnoDB The storage space of data pages is roughly divided into 7 Parts of , The number of bytes occupied by some parts is determined , The number of bytes occupied by some parts is uncertain .

On page 7 Among the three components , The records stored by ourselves will be stored in... According to the row format we specify User Records part .

But at the beginning of the page generation , Not really User Records This part , Every time we insert a record , Will come from Free Space part , In other words, the unused storage space in which a record size is applied is divided into User Records part , When Free Space Part of the space is completely User Records After partial substitution , It means that this page is used up , If there are any new records to insert , You need to apply for a new page . The process is illustrated as follows .

Just now, we talked about the process of adding data .

Let's talk about , Data search process , Suppose we need to find a record , We can load every page in the table space into memory , Then judge whether the record is what we want one by one , When the amount of data is small , No problem , Memory can also support ; But the reality is so cruel , Will not give you this situation ; To solve this problem ,mysql There is the concept of index in ; As we all know, index can speed up the query of data , What the hell is going on ？ Now I'll take a look .

6 The data structure of the index

stay mysql The data structure of the index in is almost the same as that of the page just described , And the size is also 16K, But what is recorded in the index page is the page ( Data pages , Index page ) Minimum primary key for id And page number , And adding hierarchical information to the index page , from 0 Start counting up , So there is the concept of hierarchy between pages .

After seeing this picture , Is it a little similar , Is it like a binary tree , Yes , you 're right ！ It's just a tree , It's just that we simply draw three nodes here ,2 Layer structure , If there's more data , It may extend to 3 A tree of layers , This is what we often say B+ Trees , On the bottom floor page level =0, That is, the leaf node , The rest are non leaf nodes .

Look at the picture , Let's take a single node , First, it is a non leaf node （ Index page ）, In its content area id and Page number and address are two parts , This id Is the smallest record recorded in the corresponding page id value , The page number address is a pointer to the corresponding page ; Data pages are almost the same , The difference is that the data page records the real row data rather than the page address , and id Is also sequential .

7 Recommended value of single table

So let's do that 3 layer ,2 Bifurcation （ In fact, it is M Bifurcation ） To illustrate the process of finding a row of data .

For example, we need to find a id=6 Row data , Because the page number and the smallest of the page are stored in the non leaf node id, So we start from the top , First look at the page number 10 In the directory , Yes [id=1, Page number =20],[id=5, Page number =30], Note that the left node is the smallest id by 1, The right node is the smallest id yes 5;6>5, Then follow the rule of dichotomy , Make sure to continue searching towards the right node , Find the page number 30 After the node , It is found that this node has child nodes （ Nonleaf node ）, Then keep comparing , Empathy ,6>5&&6<7, So I found the page number 60, Find the page number 60 after , It is found that this node is a leaf node （ Data nodes ）, Then load the data of this page into the memory for one-to-one comparison , It turned out that id=6 The data line .

From the above process, we find , We are looking for id=6 The data of , A total of three pages were queried , If all three pages are on disk （ Not loaded into memory in advance ）, Then you need to experience the disk up to three times IO.
It should be noted that , The page number in the figure is just an example , In fact, it is not continuous , Storage on disk is not necessarily sequential .

thus , We probably know how the data structure of the table is , You probably know how to query data , In this way, we can roughly estimate how much data such a structure can store .

From the diagram above, we know B+ It is the leaf node of the number that has data , Non leaf nodes are used to store index data .

therefore , The same one 16K Page of , Every data in a non leaf node points to a new page , There are two possibilities for a new page

If it's a leaf node , Then there are rows of data
If it is a non leaf node , Then it will continue to point to new pages

hypothesis

The number of non leaf nodes pointing to other pages is x
The number of data rows that can be accommodated in the leaf node is y
B+ The number of layers of the number is z

This is shown in the following figure
Total =x^(z-1) *y That is to say, the total will be equal to x Of z-1 Power And Y The product of the .

X =？

The structure of the page has been introduced at the beginning of the article , Indexes are no exception , There will be File Header(38 byte)、Page Header (56 Byte)、Infimum + Supermum（26 byte）、File Trailer（8byte）, Plus the page directory , Probably 1k about , Let's treat it as 1K, The size of the whole page is 16K, be left over 15k For storing data , The primary key and page number are mainly recorded in the index page , The primary key is assumed to be Bigint(8 byte), And the page number is also fixed （4Byte）, Then a piece of data in the index page is 12byte; therefore x=15*1024/12≈1280 That's ok .

Y=？

The structure of leaf nodes and non leaf nodes is the same , Empathy , The space for data is also 15k; But the leaf node stores real row data , There will be many more factors influencing this , such as , Type of field , Number of fields ; The larger the space occupied by each row of data , The fewer rows will be placed in the page ; Here, we temporarily press one row of data 1k To calculate , That page can be saved 15 strip ,Y≈15.

That's it , Is there a spectrum in your heart
According to the above formula ,Total =x^(z-1) y, It is known that x=1280,y=15
hypothesis B+ The tree has two layers , That's it Z =2, Total = （1280 ^1 ）15 = 19200
hypothesis B+ The tree has three layers , That's it Z =3, Total = （1280 ^2） *15 = 24576000 （ about 2.45kw）

Oh dear , Mama ah ！ This is exactly the recommended maximum number of lines at the beginning of the article 2000w Well ！ Right , commonly B+ The number of levels is at most 3 layer , Think about it , If it is 4 layer , Except for the disk when querying IO The number of times will increase , And this Total What would it be worth , It should be 3 More than 10 billion , It's not very reasonable , therefore ,3 Layer should be a reasonable value .

Is it over here ？

No
We were just saying Y The value of is assumed to be 1K , For example, the data space occupied by my industry is not 1K , It is 5K, Then a single data page can only be put down at most 3 Data
Again , Or in accordance with Z=3 To calculate , that Total = （1280 ^2） *3 = 4915200 （ near 500w）

therefore , At the same level （ Similar query performance ） Under the circumstances , When the row data size is different , In fact, the maximum recommended value is also different , And there are many other factors that affect query performance , such as , Database version , Server configuration ,sql And so on ,MySQL To improve performance , The index of the table is loaded into memory . stay InnoDB buffer size In enough cases , It can complete full load into memory , There will be no problem with the query . however , When a single table database reaches the upper limit of a certain magnitude , Causes memory to be unable to store its index , Make the following SQL The query will produce a disk IO, This leads to performance degradation , So add hardware configuration （ For example, using memory as a disk makes ）, It may bring immediate performance improvement .

8 summary

Mysql The table data of is stored in the form of pages , Pages are not necessarily continuous on disk .
The space of the page is 16K, Not all spaces are used to store data , There will be some fixed information , Such as , Header , footer , Page number , Check code, etc .
stay B+ In the tree , The data structures of leaf nodes and non leaf nodes are the same , The difference lies in , Leaf nodes store actual row data , Instead of leaf nodes, they store primary keys and page numbers .
The index structure will not affect the maximum number of rows in a single table ,2kw It's just the recommended value , Exceeding this value may result in B+ The tree level is higher , Affect query performance .