当前位置：网站首页>04 _ In simple terms index (I)

04 _ In simple terms index (I)

2022-06-11 15:12:00 【cjh-Java】

When it comes to database indexes , I don't think you're new , In daily work, I often come into contact with . For example, one SQL Query is slow , After analyzing the cause , You might say “ Add an index to a field ” Solutions like that . But what is an index , How does the index work ？ Let's talk about this topic today .

There are many contents in the database index , I divided it into two articles . Index is one of the most important concepts in database system , So I hope you can be patient . In the following practical articles , I will also often quote the knowledge points mentioned in these two articles , Deepen your understanding of database indexes .

To put it simply , Index is to improve the efficiency of data query , It's like a book catalog . a copy 500 Page book , If you want to quickly find one of them , Without the help of a catalog , I guess you'll have to find it for a while . Again , For tables in a database , The index is actually its “ Catalog ”.

Common models of indexes

Index appears to improve query efficiency , But there are many ways to implement indexes , So the concept of index model is introduced here . There are many data structures that can be used to improve reading and writing efficiency , Here I'll introduce you to three common 、 Also a relatively simple data structure , They are hash tables 、 Ordered arrays and search trees .

Now I mainly from the perspective of use , I'll give you a brief analysis of the differences between the three models .

Hash table is a kind of key - value （key-value） Structure of stored data , All we have to do is type in the key we want to find key, You can find the corresponding value, that is Value. The idea of hash is very simple , Put values in an array , Use a hash function to put key Convert to a certain location , And then put value In this position of the array .

inevitably , Multiple key Value is converted by hash function , There will be the same value . One way to deal with this is to , Pull out a list .

hypothesis , You now maintain a list of ID information and names , You need to find the corresponding name according to the ID number. , The corresponding hash index is shown below ：

chart 1 Hash indicates intention

In the figure ,User2 and User4 The value calculated according to the ID number is N, But never mind. , Followed by a linked list . hypothesis , At this time, you need to check ID_card_n2 What is the corresponding name , The processing steps are ： First , take ID_card_n2 It is calculated by hash function N; then , Traverse in order , find User2.

It should be noted that , Four in the picture ID_card_n The value of is not incremental , The advantage of this is to add new ones User It's going to be very fast , Just add it later . But the disadvantage is , Because it's not orderly , So the speed of hash index to do interval query is very slow .

You can imagine , If you want to find the ID number now, [ID_card_X, ID_card_Y] All users in this range , You have to scan it all .

therefore , This structure of hash table is applicable to the scenario with only equivalent query , such as Memcached And others NoSQL engine .

and The performance of ordered array in the scene of equivalent query and range query is excellent . Or the example above is the name of the ID number. , If we use ordered arrays , The schematic diagram is shown below ：

chart 2 Schematic diagram of ordered array

Here we assume that the ID number is not repeated. , This array is kept in the order of increasing the ID number . At this time, if you want to check ID_card_n2 The corresponding name , With dichotomy, you can get , This time complexity is O(log(N)).

At the same time, it's obvious , This index structure supports range queries . You need to check the ID number. [ID_card_X, ID_card_Y] The interval of User, You can find it by dichotomy ID_card_X（ If it doesn't exist ID_card_X, We find that it is greater than ID_card_X One of the first User）, And then go right , Until we find the first one greater than ID_card_Y Identity card number , Exit loop .

If we only look at query efficiency , Ordered array is the best data structure . however , When you need to update the data, it's troublesome , If you insert a record in the middle, you have to move all the records behind you , The cost is too high .

therefore , Ordered array indexes are only available for static storage engines , For example, what you want to keep is 2017 All population information of a city in , This kind of data will not be modified .

Binary search tree is also a classic data structure in textbooks . Or an example of the name above. , If we use a binary search tree to do this , The schematic diagram is shown below ：

chart 3 Diagram of binary search tree

The characteristics of binary search tree are ： The value of all nodes in the left child tree of the parent node is less than that of the parent node , The value of all nodes in the right subtree is greater than that of the parent node . So if you want to check ID_card_n2 Words , According to the search sequence in the figure is according to UserA -> UserC -> UserF -> User2 This path leads to . This time complexity is O(log(N)).

Of course, in order to maintain O(log(N)) Query complexity of , You just need to keep this tree balanced . To make this promise , The time complexity of the update is also O(log(N)).

Trees can have two forks , It can also have many forks . A multi tree means that each node has multiple sons , The size between sons is guaranteed to increase from left to right . Binary tree is the most efficient , But in fact, most database storage does not use binary tree . The reason is , Indexes don't just exist in memory , And write it to disk .

You can imagine a tree 100 The balanced binary tree of ten thousand nodes , Tree height 20. A query may need to access 20 Data blocks . In the era of mechanical hard disks , Random reading of a block from disk requires 10 ms Left and right addressing time . in other words , For one 100 Ten thousand lines of watch , If you use a binary tree to store , Accessing a row individually may require 20 individual 10 ms Time for , This query is really slow .

To make a query read as little disk as possible , The query process must access as few data blocks as possible . that , We shouldn't use binary trees , But to use “N fork ” Trees . here ,“N fork ” In the tree “N” It depends on the size of the data block .

With InnoDB An integer field index for example , This N Is almost 1200. The height of this tree is 4 When , You can save 1200 Of 3 The value of the power , This has been 17 The hundred million . Considering that the data block of the tree root is always in memory , One 10 Index of an integer field on a 100 million row table , To find a value, you only need to access at most 3 Secondary disk . Actually , The second layer of the tree also has a high probability of being in memory , So the average number of disk accesses is less .

N Due to the performance advantages of the fork tree in reading and writing , And the access mode of the adapter disk , It has been widely used in database engine .

Whether it's a hash or an ordered array , perhaps N Fork tree , They're all iterative 、 Continuously optimized products or solutions . The development of database technology today , Jump watch 、LSM Data structures such as trees are also used in engine design , I'm not going to start one by one here .

You have to have a concept in mind , The core of the underlying database storage is based on these data models . Every time a new database is encountered , We need to focus on its data model first , Only in this way can we theoretically analyze the applicable scenarios of this database .

As of now , I spent half an article introducing different data structures to you , And their applicable scenarios , You may feel a little boring . however , I suggest you take more time to understand this part , After all, this is one of the core concepts of database processing data , When analyzing problems, we often use . When you understand the index model , You will find that there will be a clearer perspective when analyzing problems , Realize the subtlety of engine design .

Now? , Let's enter the content of relatively partial actual combat .

stay MySQL in , Indexing is implemented at the storage engine level , So there is no uniform index standard , That is, different storage engines work differently . Even if multiple storage engines support the same type of index , The underlying implementation may be different . because InnoDB Storage engine in MySQL Database is the most widely used , So I'm going to InnoDB For example , Analyze the index model with you .

InnoDB The index model of

stay InnoDB in , Tables are stored in the form of indexes according to the primary key order , The tables in this way of storage are called index organization tables . Because of what we mentioned earlier ,InnoDB Used B+ Tree index model , So the data is stored in B+ In the tree .

Every index is in InnoDB It corresponds to a B+ Trees .

hypothesis , We have a primary key listed as ID Table of , There are fields in the table k, And in k There's an index on .

The table creation statement of this table is ：

mysql> create table T(
id int primary key, 
k int not null, 
name varchar(16),
index (k))engine=InnoDB;

In the table R1~R5 Of (ID,k) Values, respectively (100,1)、(200,2)、(300,3)、(500,5) and (600,6), An example of two trees is shown below .

chart 4 InnoDB The index structure of

It's not hard to see from the picture , According to the content of leaf node , Index types include primary key index and non primary key index .

The leaf node of the primary key index stores the whole row of data . stay InnoDB in , Primary key indexes are also called clustered indexes （clustered index）.

The leaf node content of the non primary key index is the value of the primary key . stay InnoDB in , Non primary key indexes are also called secondary indexes （secondary index）.

According to the index structure above , Let's talk about a problem ： What's the difference between a query based on a primary key index and a normal index ？

If the statement is select * from T where ID=500, That is, the primary key query method , Just search ID This tree B+ Trees ;
If the statement is select * from T where k=5, That is, common index query method , You need to search k Tree index , obtain ID The value of is 500, Until then ID Index tree search once . This process is called back to table .

in other words , Query based on non primary key index needs to scan an index tree . therefore , We should try our best to use primary key query in our application .

Index maintenance

B+ Tree to maintain index order , You need to do the necessary maintenance when inserting new values . Take the picture above as an example , If you insert a new row ID The value is 700, Then you only need to R5 Insert a new record after the record of . If the new insert ID The value is 400, It's relatively troublesome , We need to move the following data logically , Vacant position .

And what's worse is , If R5 The data page is full , according to B+ Tree algorithm , You need to apply for a new data page , Then move some of the data in the past . This process is called page splitting . under these circumstances , Performance is naturally affected .

In addition to performance , Page splitting also affects data page utilization . Data originally placed on one page , Now it's divided into two pages , Overall space utilization is reduced by about 50%.

Of course, when there is division, there is merger . When two adjacent pages are deleted due to data , After a very low utilization , Will merge data pages . The process of merger , It can be regarded as the reverse process of the splitting process .

Based on the above index maintenance process description , Let's discuss a case ：

You may have seen similar descriptions in some table building specifications , It is required that there must be an auto increment primary key in the table creation statement . Of course, there is no absolute , Let's analyze which scenarios should use auto primary key , And which scenes should not be .

The auto increment primary key refers to the primary key defined on the auto increment column , This is the general definition in the table creation statement ： NOT NULL PRIMARY KEY AUTO_INCREMENT.

You can not specify when inserting a new record ID Value , The system will get the current ID Maximum plus 1 As the next record ID value .

in other words , Insert data mode of auto primary key , It's in line with the incremental insert scenario we mentioned earlier . Insert a new record at a time , All are additional operations , It doesn't involve moving other records , It will not trigger the split of leaf nodes .

Fields with business logic are used as primary keys , It is often not easy to ensure orderly insertion , The cost of writing data in this way is relatively high .

Besides considering performance , We can also look at it in terms of storage space . Suppose you do have a unique field in your table , For example, the ID number of string type. , You should use the ID card number as the primary key. , Or use auto increment fields as primary keys ？

Because each leaf node of a non primary key index has a primary key value . If the ID card number is used as the primary key , Then the leaf nodes of each secondary index occupy about 20 Bytes , And if you use an integer as the primary key , Then as long as 4 Bytes , If it's a long form （bigint） It is 8 Bytes .

obviously , The smaller the primary key length , The smaller the leaf node of a normal index is , The less space a normal index takes .

therefore , In terms of performance and storage space , Auto primary key is often a more reasonable choice .

Is there any scenario suitable for using business fields as primary keys directly ？ Yes, there are. . such as , Some business scenarios require this ：

Only one index ;
The index must be unique .

You must see that , That's typical KV scene .

Because there is no other index , So there is no need to consider the leaf node size of other indexes .

At this time, we should give priority to the one mentioned in the previous paragraph “ Try to use primary key query ” principle , Set this index as the primary key directly , You can avoid having to search two trees at a time .

Summary

today , I analyzed the data structures available to the database engine with you , It introduces InnoDB Adopted B+ Tree structure , And why InnoDB That's the choice .B+ Tree can well match the read-write characteristics of the disk , Reduce the number of disk accesses for a single query .

because InnoDB It's the index organization table , In general, I would suggest that you create a self incrementing primary key , In this way, the non primary key index takes up the least space . But there is no absolute , I also discussed with you the application scenario of using business logic fields as primary keys .

原网站

版权声明
本文为[cjh-Java]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/162/202206111507484409.html