当前位置:网站首页>Why do you say that the maximum single table of MySQL database is 20million? Based on what?

Why do you say that the maximum single table of MySQL database is 20million? Based on what?

2022-07-04 16:46:00 Xiaobai debug

I am participating in the recruitment of nuggets technology community creator signing program , Click the link to sign up for submission .


The story began many years ago .

You must have heard of database single table It is suggested that 2kw The term data . If you exceed , The performance will decline greatly .

As it happens .

I've also heard of .

But I don't accept its advice , Just a single watch 1 Billion data .

Now , When the new interns in our group saw it , Innocently asked me :" Isn't the maximum recommended for a single watch 20 million ? Why does this watch Let go of 1 A hundred million yuan is not divided into banks and tables "?

I can say I am Because of laziness Do you ? When I first designed it, I didn't think this watch could rise so fast ...

I can't .

Admit yourself The cancer in the development team , Although I do , But I Can't admit .

I'm on pins and needles , disturbed , Like a fishbone getting stuck in the throat -- necessary to give vent to one's pent-up feelings .

Started a wave of operation .

" I do this for a reason "

" Although this watch is big , But have you found that its query is still very fast "

" This 2kw It's a recommended value , Let's take a look at this 2kw How did you get it "

What is the maximum number of rows in a single database table ?

Let's first look at the theoretical maximum number of rows in a single table .

Tabular SQL It's written like this ,

CREATE TABLE `user` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT ' Primary key ',
  `name` varchar(100) NOT NULL DEFAULT '' COMMENT ' name ',
  `age` int(11) NOT NULL DEFAULT '0' COMMENT ' Age ',
  PRIMARY KEY (`id`),
  KEY `idx_age` (`age`)
) ENGINE=InnoDB AUTO_INCREMENT=100037 DEFAULT CHARSET=utf8;

among id It's the primary key . The primary key itself is unique , In other words, the size of the primary key can limit the upper limit of the table .

If the primary key is declared as int size , That is to say 32 position , Then you can support 2^32-1, That is to say 21 $ about .

If it is bigint, That's it 2^64-1, But this The number is too big , Usually before this limit , The disk can't stand .

Go crazy .

If I declare the primary key as tinyint, A byte ,8 position , Maximum 2^8-1, That is to say 255.

CREATE TABLE `user` (
  `id` tinyint(2) unsigned NOT NULL AUTO_INCREMENT COMMENT ' Primary key ',
  `name` varchar(100) NOT NULL DEFAULT '' COMMENT ' name ',
  `age` int(11) NOT NULL DEFAULT '0' COMMENT ' Age ',
  PRIMARY KEY (`id`),
  KEY `idx_age` (`age`)
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8;

If I want to insert a id=256 The data of , that You're going to report a mistake .

mysql> INSERT INTO `tmp` (`id`, `name`, `age`) VALUES (256, '', 60);
ERROR 1264 (22003): Out of range value for column 'id' at row 1

in other words ,tinyint The primary key limits the maximum number of in the table 255 Data .

Except , What other factors will affect the number of rows ?

The structure of the index

The internal index is used B+ Trees , This is also an old eight part essay , It's estimated that everyone knows it very well .

In order not to let everyone have too strong fatigue of judging ugliness , Today I try to tell you about this thing from another angle .

The structure of the page

Suppose we have such a picture user Data sheet .

user surface

among id yes Unique primary key .

This looks like a row of data , For convenience , We'll call them later record Well .

This watch looks like excel The form is the same .excel The data on the hard disk is a xx.excel The file of .

And up there user Table data , In fact, it is similar on the hard disk , On the user.ibd Under the document . The meaning is user Tabular innodb data file , Be professional , Also called Table space .

Although in the data sheet , They seem to be next to each other . But it's actually user.ibd They are divided into many small parts in the Data pages , Each size 16k.

Similar to the following .

ibd There are a lot of pages inside the file

Let's focus on , Put it on the page .

The whole page 16k, not big , but record So much , There must be no room for one page , So it will be separated into many pages . And this 16k, It can't be all used for record Right .

because record The were divided into several parts , Put it on many pages , To uniquely identify which page , Then we need to introduce Page number ( In fact, it is the address offset of a table space ). At the same time, in order to associate these data pages , So I introduced Front and back pointer , Used to point to front and back pages . These are added to Header in .

Pages need to be read and written ,16k Small is not small , It's possible that half of the power cord is unplugged , So in order to ensure the correctness of the data page , A check code is also introduced . This is added to footer .

The rest of the space , It's for us record Of . and record If there are too many lines , When entering the page, traverse one by one , Efficiency is not very good , So for this data, a Page directory , The specific implementation details are not important . Just need to know , It can go through Two points search The way will find efficiency from O(n) become O(lgn) .

 Page structure

From page to index

If you want to check one record, We can get every page out of the table space , Then put the record Take it out and judge whether it's what we're looking for .

When the number of rows is small , There's no problem with this operation .

The number of rows is large , The performance is slow , So in order to speed up the search , We can choose from each data page Primary key id Minimum Of record, And only need their Primary key id And the page number of the page . form new record, Put it into a newly generated data page , This The new data page is no different from the previous page structure , And the size is still 16k.

But in order to distinguish it from the previous data page . In the data page Page level (page level) Information about , from 0 Start counting up . So there is... Between pages Upper and lower levels The concept of , It looks like this .

 Two layers of B+ Tree structure

Suddenly, it looks like an inverted tree between pages . That's what we often say B+ Trees Indexes .

The bottom floor ,page level by 0, It's called leaf node , The rest are called Non leaf node .

It shows Two layers of The tree of , If there's more data , We can also use a similar method , Build another layer up . became Three layers The tree of .

B+ Tree structure

Now we can pass through such a tree B+ Tree acceleration query . for instance .

Let's say we want to find row data 5. I'll start with the top page record Let's start .record Contains the primary key id And page number ( Page address ) . Look at the yellow arrow below , Minimum left id yes 1, Minimum right id yes 7. that id=5 If the data exists , That must be the arrow on the left . So follow the record Your page address is here 6 Number In the data page , To determine id=5>4, So it must be in the data page on the right , So load 105 Number Data pages . Find... In the data page id=5 The data line , Complete the query .

B+ Tree query process

Another thing to note , The page numbers of the above pages are not continuous , They don't have to be next to each other in the disk .

In this process, three pages were queried , If all three pages are on disk ( Not loaded into memory in advance ), that most It takes three times disk IO Inquire about , They can be loaded into memory .

B+ Number of records carried by the tree

As can be seen from the above structure B+ Treelike The last leaf node There's... In it record data . and Non leaf node There are index data used to speed up the query .

in other words

The same one 16k Page of , Every piece of data in a non leaf node points to a new page , There are two possibilities for a new page .

  • If it is the last level leaf node , Then there are lines in it record data .
  • If it is a non leaf node , Then the loop will continue to point to the new data page .

hypothesis

  • The number of pointers to other memory pages in non leaf nodes is x
  • What can be accommodated in the leaf node record The number of y
  • B+ The number of layers of the tree is z

 How to calculate the total number of rows

Then this tree B+ The tree is Total row data be equal to (x ^ (z-1)) * y.

x How to calculate

Let's go back to the structure of the data page .

 Page structure

Non leaf nodes mainly contain data related to index queries , Put the primary key and the page number .

The primary key assumption is bigint(8Byte), The page number is called... In the source code FIL_PAGE_OFFSET(4Byte), So a piece of data in the non leaf node is 12Byte about .

The whole data page 16k, The data in the header and footer add up to about 128Byte, Add the page directory, which is estimated to account for 1k Well . The rest of the 15k Divide 12Byte, be equal to 1280, That is, it can point to x=1280 page .

We often say that a binary tree refers to a node that can emit two new nodes .m A node of a fork tree can point to m A new node . This operation pointing to the new node is called Fan out (fanout) .

While the above B+ Trees , It can point to 1280 A new node , Terror is like this , so to speak The fan out is very high 了 .

y The calculation of

The data structures of leaf nodes and non leaf nodes are the same , So let's also assume that there are 15kb Can play .

The leaf node contains real row data . Suppose a row of data 1kb, So you can put... On one page y=15 That's ok .

The total number of rows is calculated

go back to (x ^ (z-1)) * y This formula. .

It is known that x=1280,y=15.

hypothesis B+ Trees are Two layers of , that z=2. It is (1280 ^ (2-1)) * 15 ≈ 2w

hypothesis B+ Trees are Three layers , that z=3. It is (1280 ^ (3-1)) * 15 ≈ 2.5kw

This 2.5kw, This is what we often call the recommended maximum number of rows in a single table 2kw The origin of . After all, add another layer , The data is a little outrageous . The three-tier data page corresponds to the disk at most three times IO, It's also more reasonable .

Is it slow when the number of rows exceeds 100 million ?

The above assumes that single line data uses 1kb, So a data page can put 15 Row data .

If I can't use so much single line data , For example, only 250byte. Then a single data page can hold 60 Row data .

That's also the third floor B+ Trees , The number of rows supported by a single table is (1280 ^ (3-1)) * 60 ≈ 1 $ .

Look at my 100 million data , In fact, it's only three floors B+ Trees , In this B+ To find a row of data in the tree , At most three times IO. So it's not slow .

This explains the beginning of the article , Why do I have a single watch 1 $ , But there's nothing wrong with query performance .

B Number of records carried by the tree

Now that we're all here , Let's talk more about this topic .

We all know , Now? mysql The index of all is B+ Trees , And there is a tree , Follow B+ The trees are very similar , It's called B Trees , Also called B- Trees .

It goes with B+ The biggest difference between trees is ,B+ The tree only places the row data of the data table at the last level leaf node , and B The tree will put... On both leaf and non leaf nodes .

therefore ,B The structure of the tree is similar to this

B Tree structure

B The tree will store all row data on non leaf nodes , Suppose each data page is still 16kb, Start and finish, leaving... On each page 15kb, And the row data of a data table still accounts for 1kb, Even without considering various page pointers , You can only put 15 Data . The fan out of data pages is obviously less .

The formula for calculating the total number of rows that can be carried has also become a Geometric series .

15 + 15^2 +15^3 + ... + 15^z

among z Or the number of layers It means .

In order to put 2kw Left and right data , need z>=6. That is, trees need to have 6 layer , Check an interview 6 A page . Assume that this 6 Pages are not continuous , In order to query one of the data , The worst case needs to be corrected 6 Secondary disk IO.

and B+ The same goes for trees 2kw Data left and right , Check once, at most 3 Secondary disk IO.

disk IO The more, the slower , The performance gap between the two is slightly larger .

So ,B+ Tree ratio B Trees are better suited to be mysql The index of .

summary

  • B+ The data pages of tree leaf and non leaf nodes are 16k, And the data structure is consistent , The difference is that the leaf node puts real row data , Instead of leaf nodes, the primary key and the address of the next page are placed .
  • B+ Trees usually have two to three floors , Because of its high fan out , Three layers can support 2kw The above data , And one query at most 1~3 Secondary disk IO, The performance is also OK .
  • Store data of the same magnitude ,B Tree ratio B+ The tree level is higher , So disk IO And more , therefore B+ Trees are better suited to be mysql Indexes .
  • The index structure will not affect the maximum number of rows in a single table ,2kw It's just the recommended value , Exceeding this value may result in B+ The tree level is higher , Affect query performance .
  • The maximum value of a single table is also limited by the primary key size and disk size .

Reference material

《MYSQL kernel :INNODB Storage engine volume 1》

Last

Although I put... In the list 1 Billion data , But the premise of this operation is , I know it won't affect performance much .

This wave explanation , There is no flaw , With no chink in one's armour .

Come here , Even I was convinced by myself . So must interns .

Abominable , This damn tumor has some " Knowledgeable ".

Recently, the reading volume of original articles has fallen steadily , focused , Toss and turn at night .

I have an immature request .

I've been away from Guangdong for a long time , I haven't been called pretty for a long time .

You can Comment area in , Call me a pretty boy ?

I have such a kind and simple wish , Can you be satisfied ?

If you really can't say it , Can you help me point the one in the lower right corner Dianzanhe is watching Do you ?

Don't mention it. , Let's choke together in the ocean of knowledge

I am participating in the recruitment of nuggets technology community creator signing program , Click the link to sign up for submission .

原网站

版权声明
本文为[Xiaobai debug]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/185/202207041451084840.html

随机推荐