当前位置:网站首页>04 _ In simple terms index (I)
04 _ In simple terms index (I)
2022-06-11 15:12:00 【cjh-Java】
When it comes to database indexes , I don't think you're new , In daily work, I often come into contact with . For example, one SQL Query is slow , After analyzing the cause , You might say “ Add an index to a field ” Solutions like that . But what is an index , How does the index work ? Let's talk about this topic today .
There are many contents in the database index , I divided it into two articles . Index is one of the most important concepts in database system , So I hope you can be patient . In the following practical articles , I will also often quote the knowledge points mentioned in these two articles , Deepen your understanding of database indexes .
To put it simply , Index is to improve the efficiency of data query , It's like a book catalog . a copy 500 Page book , If you want to quickly find one of them , Without the help of a catalog , I guess you'll have to find it for a while . Again , For tables in a database , The index is actually its “ Catalog ”.
Common models of indexes
Index appears to improve query efficiency , But there are many ways to implement indexes , So the concept of index model is introduced here . There are many data structures that can be used to improve reading and writing efficiency , Here I'll introduce you to three common 、 Also a relatively simple data structure , They are hash tables 、 Ordered arrays and search trees .
Now I mainly from the perspective of use , I'll give you a brief analysis of the differences between the three models .
Hash table is a kind of key - value (key-value) Structure of stored data , All we have to do is type in the key we want to find key, You can find the corresponding value, that is Value. The idea of hash is very simple , Put values in an array , Use a hash function to put key Convert to a certain location , And then put value In this position of the array .
inevitably , Multiple key Value is converted by hash function , There will be the same value . One way to deal with this is to , Pull out a list .
hypothesis , You now maintain a list of ID information and names , You need to find the corresponding name according to the ID number. , The corresponding hash index is shown below :

In the figure ,User2 and User4 The value calculated according to the ID number is N, But never mind. , Followed by a linked list . hypothesis , At this time, you need to check ID_card_n2 What is the corresponding name , The processing steps are : First , take ID_card_n2 It is calculated by hash function N; then , Traverse in order , find User2.
It should be noted that , Four in the picture ID_card_n The value of is not incremental , The advantage of this is to add new ones User It's going to be very fast , Just add it later . But the disadvantage is , Because it's not orderly , So the speed of hash index to do interval query is very slow .
You can imagine , If you want to find the ID number now, [ID_card_X, ID_card_Y] All users in this range , You have to scan it all .
therefore , This structure of hash table is applicable to the scenario with only equivalent query , such as Memcached And others NoSQL engine .
and The performance of ordered array in the scene of equivalent query and range query is excellent . Or the example above is the name of the ID number. , If we use ordered arrays , The schematic diagram is shown below :

Here we assume that the ID number is not repeated. , This array is kept in the order of increasing the ID number . At this time, if you want to check ID_card_n2 The corresponding name , With dichotomy, you can get , This time complexity is O(log(N)).
At the same time, it's obvious , This index structure supports range queries . You need to check the ID number. [ID_card_X, ID_card_Y] The interval of User, You can find it by dichotomy ID_card_X( If it doesn't exist ID_card_X, We find that it is greater than ID_card_X One of the first User), And then go right , Until we find the first one greater than ID_card_Y Identity card number , Exit loop .
If we only look at query efficiency , Ordered array is the best data structure . however , When you need to update the data, it's troublesome , If you insert a record in the middle, you have to move all the records behind you , The cost is too high .
therefore , Ordered array indexes are only available for static storage engines , For example, what you want to keep is 2017 All population information of a city in , This kind of data will not be modified .
Binary search tree is also a classic data structure in textbooks . Or an example of the name above. , If we use a binary search tree to do this , The schematic diagram is shown below :

The characteristics of binary search tree are : The value of all nodes in the left child tree of the parent node is less than that of the parent node , The value of all nodes in the right subtree is greater than that of the parent node . So if you want to check ID_card_n2 Words , According to the search sequence in the figure is according to UserA -> UserC -> UserF -> User2 This path leads to . This time complexity is O(log(N)).
Of course, in order to maintain O(log(N)) Query complexity of , You just need to keep this tree balanced . To make this promise , The time complexity of the update is also O(log(N)).
Trees can have two forks , It can also have many forks . A multi tree means that each node has multiple sons , The size between sons is guaranteed to increase from left to right . Binary tree is the most efficient , But in fact, most database storage does not use binary tree . The reason is , Indexes don't just exist in memory , And write it to disk .
You can imagine a tree 100 The balanced binary tree of ten thousand nodes , Tree height 20. A query may need to access 20 Data blocks . In the era of mechanical hard disks , Random reading of a block from disk requires 10 ms Left and right addressing time . in other words , For one 100 Ten thousand lines of watch , If you use a binary tree to store , Accessing a row individually may require 20 individual 10 ms Time for , This query is really slow .
To make a query read as little disk as possible , The query process must access as few data blocks as possible . that , We shouldn't use binary trees , But to use “N fork ” Trees . here ,“N fork ” In the tree “N” It depends on the size of the data block .
With InnoDB An integer field index for example , This N Is almost 1200. The height of this tree is 4 When , You can save 1200 Of 3 The value of the power , This has been 17 The hundred million . Considering that the data block of the tree root is always in memory , One 10 Index of an integer field on a 100 million row table , To find a value, you only need to access at most 3 Secondary disk . Actually , The second layer of the tree also has a high probability of being in memory , So the average number of disk accesses is less .
N Due to the performance advantages of the fork tree in reading and writing , And the access mode of the adapter disk , It has been widely used in database engine .
Whether it's a hash or an ordered array , perhaps N Fork tree , They're all iterative 、 Continuously optimized products or solutions . The development of database technology today , Jump watch 、LSM Data structures such as trees are also used in engine design , I'm not going to start one by one here .
You have to have a concept in mind , The core of the underlying database storage is based on these data models . Every time a new database is encountered , We need to focus on its data model first , Only in this way can we theoretically analyze the applicable scenarios of this database .
As of now , I spent half an article introducing different data structures to you , And their applicable scenarios , You may feel a little boring . however , I suggest you take more time to understand this part , After all, this is one of the core concepts of database processing data , When analyzing problems, we often use . When you understand the index model , You will find that there will be a clearer perspective when analyzing problems , Realize the subtlety of engine design .
Now? , Let's enter the content of relatively partial actual combat .
stay MySQL in , Indexing is implemented at the storage engine level , So there is no uniform index standard , That is, different storage engines work differently . Even if multiple storage engines support the same type of index , The underlying implementation may be different . because InnoDB Storage engine in MySQL Database is the most widely used , So I'm going to InnoDB For example , Analyze the index model with you .
InnoDB The index model of
stay InnoDB in , Tables are stored in the form of indexes according to the primary key order , The tables in this way of storage are called index organization tables . Because of what we mentioned earlier ,InnoDB Used B+ Tree index model , So the data is stored in B+ In the tree .
Every index is in InnoDB It corresponds to a B+ Trees .
hypothesis , We have a primary key listed as ID Table of , There are fields in the table k, And in k There's an index on .
The table creation statement of this table is :
mysql> create table T(
id int primary key,
k int not null,
name varchar(16),
index (k))engine=InnoDB;
In the table R1~R5 Of (ID,k) Values, respectively (100,1)、(200,2)、(300,3)、(500,5) and (600,6), An example of two trees is shown below .

It's not hard to see from the picture , According to the content of leaf node , Index types include primary key index and non primary key index .
The leaf node of the primary key index stores the whole row of data . stay InnoDB in , Primary key indexes are also called clustered indexes (clustered index).
The leaf node content of the non primary key index is the value of the primary key . stay InnoDB in , Non primary key indexes are also called secondary indexes (secondary index).
According to the index structure above , Let's talk about a problem : What's the difference between a query based on a primary key index and a normal index ?
- If the statement is select * from T where ID=500, That is, the primary key query method , Just search ID This tree B+ Trees ;
- If the statement is select * from T where k=5, That is, common index query method , You need to search k Tree index , obtain ID The value of is 500, Until then ID Index tree search once . This process is called back to table .
in other words , Query based on non primary key index needs to scan an index tree . therefore , We should try our best to use primary key query in our application .
Index maintenance
B+ Tree to maintain index order , You need to do the necessary maintenance when inserting new values . Take the picture above as an example , If you insert a new row ID The value is 700, Then you only need to R5 Insert a new record after the record of . If the new insert ID The value is 400, It's relatively troublesome , We need to move the following data logically , Vacant position .
And what's worse is , If R5 The data page is full , according to B+ Tree algorithm , You need to apply for a new data page , Then move some of the data in the past . This process is called page splitting . under these circumstances , Performance is naturally affected .
In addition to performance , Page splitting also affects data page utilization . Data originally placed on one page , Now it's divided into two pages , Overall space utilization is reduced by about 50%.
Of course, when there is division, there is merger . When two adjacent pages are deleted due to data , After a very low utilization , Will merge data pages . The process of merger , It can be regarded as the reverse process of the splitting process .
Based on the above index maintenance process description , Let's discuss a case :
You may have seen similar descriptions in some table building specifications , It is required that there must be an auto increment primary key in the table creation statement . Of course, there is no absolute , Let's analyze which scenarios should use auto primary key , And which scenes should not be .
The auto increment primary key refers to the primary key defined on the auto increment column , This is the general definition in the table creation statement : NOT NULL PRIMARY KEY AUTO_INCREMENT.
You can not specify when inserting a new record ID Value , The system will get the current ID Maximum plus 1 As the next record ID value .
in other words , Insert data mode of auto primary key , It's in line with the incremental insert scenario we mentioned earlier . Insert a new record at a time , All are additional operations , It doesn't involve moving other records , It will not trigger the split of leaf nodes .
Fields with business logic are used as primary keys , It is often not easy to ensure orderly insertion , The cost of writing data in this way is relatively high .
Besides considering performance , We can also look at it in terms of storage space . Suppose you do have a unique field in your table , For example, the ID number of string type. , You should use the ID card number as the primary key. , Or use auto increment fields as primary keys ?
Because each leaf node of a non primary key index has a primary key value . If the ID card number is used as the primary key , Then the leaf nodes of each secondary index occupy about 20 Bytes , And if you use an integer as the primary key , Then as long as 4 Bytes , If it's a long form (bigint) It is 8 Bytes .
obviously , The smaller the primary key length , The smaller the leaf node of a normal index is , The less space a normal index takes .
therefore , In terms of performance and storage space , Auto primary key is often a more reasonable choice .
Is there any scenario suitable for using business fields as primary keys directly ? Yes, there are. . such as , Some business scenarios require this :
Only one index ;
The index must be unique .
You must see that , That's typical KV scene .
Because there is no other index , So there is no need to consider the leaf node size of other indexes .
At this time, we should give priority to the one mentioned in the previous paragraph “ Try to use primary key query ” principle , Set this index as the primary key directly , You can avoid having to search two trees at a time .
Summary
today , I analyzed the data structures available to the database engine with you , It introduces InnoDB Adopted B+ Tree structure , And why InnoDB That's the choice .B+ Tree can well match the read-write characteristics of the disk , Reduce the number of disk accesses for a single query .
because InnoDB It's the index organization table , In general, I would suggest that you create a self incrementing primary key , In this way, the non primary key index takes up the least space . But there is no absolute , I also discussed with you the application scenario of using business logic fields as primary keys .
边栏推荐
- Ali, tell me about the application scenarios of message oriented middleware?
- 2022质量员-市政方向-岗位技能(质量员)考试模拟100题及模拟考试
- 【SystemVerilog 之 过程块和方法】~ 域、always过程块、initial过程块、函数 function、任务 task、生命周期
- 07 _ 行锁功过:怎么减少行锁对性能的影响?
- 基于STM32F1的开源小项目
- How about art plus online school? Is it a new online organization?
- When open source meets KPI, globalization vs localization, how can the ideal and reality of open source be reconciled?
- Zhejiang University has developed a UAV, which can automatically avoid obstacles and walk through the woods like a bird. The real swarm is coming
- Did you break the rules?
- Installation and use of sonarqube
猜你喜欢

Mysql(九)Your password has expired. To log in you must change it using a client that supports expired

数据库优化

06 _ 全局锁和表锁 :给表加个字段怎么有这么多阻碍?

High number_ Chapter 6 infinite series__ Marklaurin series

C language simple webserver

对于事务的认识

How to play seek tiger, which has attracted much attention in the market?

【SystemVerilog 之 验证】~ 测试平台、硬件设计描述、激励发生器、监测器、比较器

Implementation of gray-scale publishing scheme for microservice architecture based on gateway and Nacos

Hebei huangjinzhai scenic spot adds "AED automatic defibrillator" to ensure the life safety of tourists!
随机推荐
19. Insertion et suppression d'un arbre de recherche binaire
Individual income tax rate table
In depth analysis of "circle group" relationship system design | series of articles on "circle group" technology
Social software soul withdraws its IPO application: Tencent is a major shareholder
Hamad application layout scheme 02 of hashicopy
Station B executives interpret the financial report: the epidemic has no impact on the company's long-term development, and the video trend is irresistible
Riskscanner of multi Cloud Security compliance scanning platform
one hundred and twenty-three thousand four hundred and sixty-five
Implementation of the function of recording login status
Recyclerview usage record
Flutter 3.0 was officially released: it stably supports 6 platforms, and byte jitter is the main user
Repository Manager之Nexus配置yum仓库
Hashicopy之nomad应用编排方案02
Illustration of tiger international quarterly report: revenue of USD 52.63 million continued to be internationalized
基于STM32F1的开源小项目
C language simple webserver
LoveLive! Published an AI paper: generating models to write music scores automatically
2022年湖南省安全员-C证考试练习题及在线模拟考试
Microservices - use of Nacos
Tencent interviewers share their interview experience, how to evaluate the interviewers' technical and personal comprehensive quality, and give you some suggestions on the interview