当前位置：网站首页>Some personal understandings about MySQL indexes (partially refer to MySQL45 lectures)

Some personal understandings about MySQL indexes (partially refer to MySQL45 lectures)

2022-07-29 23:53:00 【lonelyMangoo】

Overview

Wrote the data structure of MySQL index, covering index, leftmost prefix rule, index pushdown and one-to-one interview questions I encountered.

What is an index

The appearance of the index is to improve the query efficiency, just like the directory of the book

Data structure of the index

Hashtable

It is a key-value structure, the same key value uses the zipper method
In the scenario of equal value query, the efficiency is very high,
For ordered arrays, the efficiency of using ordered arrays for range queries can be as follows:Achieving O(log(N)), however, maintaining an ordered array is too costly and requires constant movement of data, so ordered data is only suitable for static storage engines.

Two (N)-ary tree

When using a binary tree, it may cause the tree to be too high, and the index is not only in memory, but also written to disk.Each query has to go through many unnecessary nodes, and the data accessed is too fast and time-consuming.
So you can use the N-ary tree, you can reduce the read disk!In InnoDB, if N is 1200 and the tree height is 4, 120012001200 can be stored, and there are 1.7 billion data.

InnoDB: B+Tree

Each index corresponds to a B+ tree in InnoDB.
The leaf node of the primary key index stores the entire row of data, and the primary key index is also called the clustered index.
The leaf nodes of the non-primary key index are the primary key value.Non-primary key indexes are also known as secondary indexes.

What is the difference between a query based on a primary key index and a normal index?

For the primary key query, directly search the ID B+ tree
instead of the primary key index to find the ID and then search the ID index book, this process is called return table.

Why use bigint to increment primary key?

Such an insertion is an append operation, which does not require moving other records, nor does it initiate the splitting of child nodes.
And suppose we use varchar as the primary key. For example, the snowflake algorithm needs 19 bytes (different encoding methods), and the bigint only needs 8 bytes. The length of the primary key is short, which reduces the number of leaf nodes.Ordinary indexes also take up less space.

Why use a B+ tree?

The tree is not tall and can store a large amount of data
The non-leaf node single page of the B+ tree can store more keywords. The more keywords that are read into memory at one time, the fewer random I/O reads from the disk.Reduce the number of disk accesses for a single query.(The memory stores more keys, the data is stored more closely, and has better spatial locality. Therefore, accessing the associated data on the leaf node also has a better cache hit rate.)
B+ tree can well support single-point query, range query, ordered query (leaf node linked list).
B+ tree query efficiency is more stable, it must be O(logN),

Why not use a B-tree?

The child nodes of tree B need to store data, the depth increases, the number of I/Os increases, and the query performance decreases

Covering Index

When querying the index, the table will be returned, so the index needs to be checked twice. How to optimize the index to avoid returning the table?
That is using a covering index.
The covering index already contains the query results and does not need to return to the table, which reduces the number of searches and significantly improves the query performance.

How to weigh the need to add a covering index?

If it is a high-frequency request, the joint index makes sense, reducing the execution time of the statement, but the maintenance of the index has a cost.

Leftmost prefix rule

This leftmost prefix can be the leftmost N fields of a union index, or the leftmost M characters of a string index.(For example, like Zhang%, you can find Zhang xx first)
But if (a, b), use Zhang% on the a field, the b field will be invalid.(mysql will keep matching to the right until it encounters a range query (>, <, between, like) and then stop matching. Range columns can use indexes, but columns after the range column cannot use indexes. That is, an index can be used for at most oneThe range column, so if there are two range columns in the query condition, the index cannot be used in full. Why? Thinking that the index is arranged in lexicographic order, the first order cannot guarantee that there will be the following ones.sequence.)

How do I choose an index?

The first principle is that if one less index can be maintained by adjusting the order, then this order is often the one that needs to be prioritized.

For example, the joint index of (a,b) satisfies both (a,b) and (a), but (b) does not satisfy

Index Pushdown

Give an example
Now there is a joint index of (a,b), but to check the data of a, b, c
Then a uses Zhang%, obviously b will be invalid.
Versions before 5.6 After matching a sheet, all data will be directly returned to the table.
Versions after 5.6 After matching the sheet, you will get b (in the joint index), and compare it with the b value to be checked. If the error is directly discarded, the number of times of returning to the table will be reduced

An interview question

I encountered the following question in my interview with Baidu before, I probably should...
(a,b,c) fields have index(a), index(b) and index(a,b,c)
When the query conditions are a and b, is it faster to use two separate indexes or a joint index?
I don't know what the internal optimization is like here, but the joint index definitely does not need to return to the table, obviously the latter is faster.

Test 600w data, no doubt use joint index

Delete the joint index, leave a single-column index, and directly become a full table scan,
After adding the query condition, although the index is used, it is still a full tableScanning, extremely inefficient
To sum up, covering index yyds!