当前位置：网站首页>Basic principle of LSM tree

Basic principle of LSM tree

2022-07-03 19:13:00 【xmh-sxh-1314】

watermark,type_d3F5LXplbmhlaQ,shadow_50,text_Q1NETiBAeG1oLXN4aC0xMzE0,size_10,color_FFFFFF,t_70,g_se,x_16 LSM-tree

LSM-tree Come of 1996 A paper in 《The Log-Structured Merge-Tree (LSM-Tree)》, Today's content and pictures are mainly from FAST'16 Of 《WiscKey: Separating Keys from Values in SSD-conscious Storage》.

Look at the name first ,log-structured, Log structure , The log is typed by the software system , It's like a diary , Page by page , And the system can write the log correctly , So there's no need to change , Just add it at the back . The pre write logs of various databases are also appending , Therefore, the basic meaning of log structure refers to adding . Notice that he is still a “Merge-tree”, That is to say “ Merge - Trees ”, To combine is to combine several into one .

good , No nonsense. , Say the text .

LSM-tree It's for key-value Storage system design ,key-value There are two main functions of a storage system ,put（k,v）： Write a （k,v）,get（k）： Given a k lookup v.

LSM-tree The biggest feature is the fast writing speed , It mainly uses the sequential writing of the disk ,pk Dropped what needs to be written randomly B-tree. For the order and random writing of disks, please refer to ：《 Various concepts of hard disk 》

The picture below is LSM-tree Component part , It's a multi-layer structure , Just like a tree , Up small down big . First of all, memory C0 layer , Save all recently written （k,v）, This memory structure is ordered , And can be updated in place at any time , At the same time, it supports query at any time . The rest C1 To Ck The layers are all on disk , Each floor is one in key An orderly structure .

LSM-tree

Write process ： One put（k,v） Here comes the operation , First add to the pre write log （Write Ahead Log, That is to say, the log recorded before is actually written ） in , Next add C0 layer . When C0 Layer data reaches a certain size , Just put C0 layer and C1 Layer merge , Sort like merge , This process is Compaction（ Merge ）. The new... That comes out of the merger new-C1 Write the disks in sequence , Replace the original old-C1. When C1 Layers reach a certain size , Will continue to merge with the lower level . All old files can be deleted after merging , Leave a new .

Note that data may be written repeatedly , The new version needs to overwrite the old version . What is the new version , I'll write first （a=1）, To write （a=233）,233 It's the new edition . If a The old version has arrived Ck The layer , Now C0 There's a new version of layer , At this time, I will not check whether there is an old version of the document under my control , The old version was cleaned up at the time of the merger .

The writing process basically only uses the memory structure ,Compaction It can be done asynchronously in the background , No blocking writes .

Query process ： In the write process, you can see , The latest figures are in C0 layer , The oldest data is Ck layer , So the query is also to check first C0 layer , If there's nothing to look up k, check C1, Check layer by layer .

A single query may require multiple single point queries , A little slower . therefore LSM-tree The main scene is write intensive 、 A few query scenarios .

LSM-tree Used in various key value databases , Such as LevelDB,RocksDB, There are also distributed row storage databases Cassandra Also used. LSM-tree Storage architecture .

LevelDB

In fact, just looking at the model above, there are still some problems , For example, will C0 Follow C1 After the merger , What about new writes ？ in addition , Every time I have to C0 Follow C1 Merge , This background sorting is also very troublesome . Here we use LevelDB For example , Take a look at how the actual system is used LSM-tree Of thought .

The picture below is LevelDB The architecture of , First ,LSM-tree It is divided into three kinds of documents , The first is two in memory memtable, One is to receive write requests normally memtable, One cannot be modified immutable memtable.

LevelDB

The other part is on disk SStable （Sorted String Table）, Ordered string table , This ordered string is the data key.SStable There are seven floors （L0 To L6）. The total size limit of the next layer is that of the previous layer 10 times .

Write process ： First, add the write operation to the pre write log , Next, write the data to memtable in , When memtable Full of , Will this memtable Switch to immutable immutable memtable, And open a new one memtable Receive a new write request . And this immutable memtable You can brush the disk . Here, the disk is directly brushed L0 Layer of SSTable file , Not directly with L0 Layer file merge .

The total size of all files in each layer is limited , Each lower layer is ten times larger . Once the total size of a layer exceeds the threshold , Just select a file and merge it with the next level file . It's like playing 2048 equally , Every time you can trigger a merge, it will trigger , This is in 2048 Inside is the best , But it's troublesome in the system , Because there is a lot of data to toss , But it's not a bad thing , Because this can speed up the query .

Note here , All the files affected at the next level will participate Compaction. After the merger , Guarantee L1 To L6 The data of each layer of the layer is key The overall situation is orderly . and L0 Layers can overlap .

Compaction

The above figure is an example , One immutable memtable Brush to L0 After the layer , Trigger L0 and L1 The merger of , If the Yellow document is related to this merger , After the merger ,L0 The layer is deleted ,L1 The layer is updated ,L1 The layer is still globally ordered , The data order of the three files is abcdef.

although L0 Multiple files of layer are on the same layer , But there is also a sequential relationship , The same one behind key The data of will also cover the previous . How to distinguish here ？ For each key-value Add a version number . So in Compaction Only the latest version should be left at that time .

Query process ： First check memtable, check immutable memtable, Then check L0 All files of layer , Check down from the last layer to the next .

LSM-tree Read write enlarge

Read write enlarge （read and write amplification） yes LSM-tree Main problems of , By that definition ： Read write enlarge = The amount of data actually read and written on the disk / The amount of data users need . Note that only the amount of data interacting with the disk is counted , It doesn't matter how many times this data is calculated in memory . For example, the user was supposed to write 1KB data , As a result, you calculated it in memory 1 Hours , Finally, I wrote to the disk 10KB The data of , Write and enlarge is 10, Reading is similar .

Write amplification ： We use RocksDB Of Level Style Compaction Take the mechanism as an example , This merging mechanism takes all the files of the upper layer and merges them with the next layer at a time , The size of the next layer is the size of the previous layer r times . In this way, the write magnification of a single merge is r times , Here is r Times or r+1 Times are related to concrete implementation , Let's take an example .

If there are three floors now , The file sizes are ：9,90,900,r=10. Wrote another 1, At this time, it will continue to merge ,1+9=10,10+90=100,100+900=1000. It's written in total 10+100+1000. It is reasonable to write that the magnification should be 1110/1, But that's not what's said in various papers , It is said in the paper that the sum to the right of the equal sign is greater than the sum to the left of the plus sign , That is to say 10/1 + 100/10 + 1000/100 = 30 = r * level. I feel that writing and enlarging is a process , It's not very accurate to measure with one number , And this is just the worst case .

Reading amplification ： To query a 1KB The data of . At worst, you need to read L0 Layer of 8 File , read L1 To L6 Every file in the library , altogether 14 File . And each file needs to be read inside 16KB The index of ,4KB The bloom filter ,4KB A block of data （ It doesn't matter if you don't understand , Just know from a SSTable Check one in the key, Just need to read so much ）. altogether 24*14/1=336 times .key-value The smaller the reading, the larger the magnification .

原网站

版权声明
本文为[xmh-sxh-1314]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202150155461679.html

当前位置：网站首页>Basic principle of LSM tree

Basic principle of LSM tree

边栏推荐

猜你喜欢

随机推荐