当前位置:网站首页>Basic principle of LSM tree
Basic principle of LSM tree
2022-07-03 19:13:00 【xmh-sxh-1314】
LSM-tree
LSM-tree Come of 1996 A paper in 《The Log-Structured Merge-Tree (LSM-Tree)》, Today's content and pictures are mainly from FAST'16 Of 《WiscKey: Separating Keys from Values in SSD-conscious Storage》.
Look at the name first ,log-structured, Log structure , The log is typed by the software system , It's like a diary , Page by page , And the system can write the log correctly , So there's no need to change , Just add it at the back . The pre write logs of various databases are also appending , Therefore, the basic meaning of log structure refers to adding . Notice that he is still a “Merge-tree”, That is to say “ Merge - Trees ”, To combine is to combine several into one .
good , No nonsense. , Say the text .
LSM-tree It's for key-value Storage system design ,key-value There are two main functions of a storage system ,put(k,v): Write a (k,v),get(k): Given a k lookup v.
LSM-tree The biggest feature is the fast writing speed , It mainly uses the sequential writing of the disk ,pk Dropped what needs to be written randomly B-tree. For the order and random writing of disks, please refer to :《 Various concepts of hard disk 》
The picture below is LSM-tree Component part , It's a multi-layer structure , Just like a tree , Up small down big . First of all, memory C0 layer , Save all recently written (k,v), This memory structure is ordered , And can be updated in place at any time , At the same time, it supports query at any time . The rest C1 To Ck The layers are all on disk , Each floor is one in key An orderly structure .
LSM-tree
Write process : One put(k,v) Here comes the operation , First add to the pre write log (Write Ahead Log, That is to say, the log recorded before is actually written ) in , Next add C0 layer . When C0 Layer data reaches a certain size , Just put C0 layer and C1 Layer merge , Sort like merge , This process is Compaction( Merge ). The new... That comes out of the merger new-C1 Write the disks in sequence , Replace the original old-C1. When C1 Layers reach a certain size , Will continue to merge with the lower level . All old files can be deleted after merging , Leave a new .
Note that data may be written repeatedly , The new version needs to overwrite the old version . What is the new version , I'll write first (a=1), To write (a=233),233 It's the new edition . If a The old version has arrived Ck The layer , Now C0 There's a new version of layer , At this time, I will not check whether there is an old version of the document under my control , The old version was cleaned up at the time of the merger .
The writing process basically only uses the memory structure ,Compaction It can be done asynchronously in the background , No blocking writes .
Query process : In the write process, you can see , The latest figures are in C0 layer , The oldest data is Ck layer , So the query is also to check first C0 layer , If there's nothing to look up k, check C1, Check layer by layer .
A single query may require multiple single point queries , A little slower . therefore LSM-tree The main scene is write intensive 、 A few query scenarios .
LSM-tree Used in various key value databases , Such as LevelDB,RocksDB, There are also distributed row storage databases Cassandra Also used. LSM-tree Storage architecture .
LevelDB
In fact, just looking at the model above, there are still some problems , For example, will C0 Follow C1 After the merger , What about new writes ? in addition , Every time I have to C0 Follow C1 Merge , This background sorting is also very troublesome . Here we use LevelDB For example , Take a look at how the actual system is used LSM-tree Of thought .
The picture below is LevelDB The architecture of , First ,LSM-tree It is divided into three kinds of documents , The first is two in memory memtable, One is to receive write requests normally memtable, One cannot be modified immutable memtable.
LevelDB
The other part is on disk SStable (Sorted String Table), Ordered string table , This ordered string is the data key.SStable There are seven floors (L0 To L6). The total size limit of the next layer is that of the previous layer 10 times .
Write process : First, add the write operation to the pre write log , Next, write the data to memtable in , When memtable Full of , Will this memtable Switch to immutable immutable memtable, And open a new one memtable Receive a new write request . And this immutable memtable You can brush the disk . Here, the disk is directly brushed L0 Layer of SSTable file , Not directly with L0 Layer file merge .
The total size of all files in each layer is limited , Each lower layer is ten times larger . Once the total size of a layer exceeds the threshold , Just select a file and merge it with the next level file . It's like playing 2048 equally , Every time you can trigger a merge, it will trigger , This is in 2048 Inside is the best , But it's troublesome in the system , Because there is a lot of data to toss , But it's not a bad thing , Because this can speed up the query .
Note here , All the files affected at the next level will participate Compaction. After the merger , Guarantee L1 To L6 The data of each layer of the layer is key The overall situation is orderly . and L0 Layers can overlap .
Compaction
The above figure is an example , One immutable memtable Brush to L0 After the layer , Trigger L0 and L1 The merger of , If the Yellow document is related to this merger , After the merger ,L0 The layer is deleted ,L1 The layer is updated ,L1 The layer is still globally ordered , The data order of the three files is abcdef.
although L0 Multiple files of layer are on the same layer , But there is also a sequential relationship , The same one behind key The data of will also cover the previous . How to distinguish here ? For each key-value Add a version number . So in Compaction Only the latest version should be left at that time .
Query process : First check memtable, check immutable memtable, Then check L0 All files of layer , Check down from the last layer to the next .
LSM-tree Read write enlarge
Read write enlarge (read and write amplification) yes LSM-tree Main problems of , By that definition : Read write enlarge = The amount of data actually read and written on the disk / The amount of data users need . Note that only the amount of data interacting with the disk is counted , It doesn't matter how many times this data is calculated in memory . For example, the user was supposed to write 1KB data , As a result, you calculated it in memory 1 Hours , Finally, I wrote to the disk 10KB The data of , Write and enlarge is 10, Reading is similar .
Write amplification : We use RocksDB Of Level Style Compaction Take the mechanism as an example , This merging mechanism takes all the files of the upper layer and merges them with the next layer at a time , The size of the next layer is the size of the previous layer r times . In this way, the write magnification of a single merge is r times , Here is r Times or r+1 Times are related to concrete implementation , Let's take an example .
If there are three floors now , The file sizes are :9,90,900,r=10. Wrote another 1, At this time, it will continue to merge ,1+9=10,10+90=100,100+900=1000. It's written in total 10+100+1000. It is reasonable to write that the magnification should be 1110/1, But that's not what's said in various papers , It is said in the paper that the sum to the right of the equal sign is greater than the sum to the left of the plus sign , That is to say 10/1 + 100/10 + 1000/100 = 30 = r * level. I feel that writing and enlarging is a process , It's not very accurate to measure with one number , And this is just the worst case .
Reading amplification : To query a 1KB The data of . At worst, you need to read L0 Layer of 8 File , read L1 To L6 Every file in the library , altogether 14 File . And each file needs to be read inside 16KB The index of ,4KB The bloom filter ,4KB A block of data ( It doesn't matter if you don't understand , Just know from a SSTable Check one in the key, Just need to read so much ). altogether 24*14/1=336 times .key-value The smaller the reading, the larger the magnification .
边栏推荐
- Record: MySQL changes the time zone
- High concurrency Architecture - read write separation
- The earliest record
- What does a really excellent CTO look like in my eyes
- 东数西算拉动千亿产业,敢啃“硬骨头”的存储厂商才更有机会
- my. INI file not found
- Analyse du Code du planificateur ego bspline Section Optimizer (1)
- Compose LazyColumn 顶部添加控件
- cipher
- Su embedded training - Day10
猜你喜欢
my. INI file not found
Ego planner code parsing Bspline_ Optimizer section (2)
If the warehouse management communication is not in place, what problems will occur?
SQL: special update operation
These problems should be paid attention to in the production of enterprise promotional videos
KINGS
[new year job hopping season] test the technical summary of interviewers' favorite questions (with video tutorials and interview questions)
Recommend a GIF processing artifact less than 300K - gifsicle (free download)
我眼中真正优秀的CTO长啥样
SSM整合-前后台协议联调(列表功能、添加功能、添加功能状态处理、修改功能、删除功能)
随机推荐
Leetcode: 11. Récipient contenant le plus d'eau [double pointeur + cupidité + enlèvement de la plaque la plus courte]
__ Weak and__ The difference between blocks
235. 二叉搜索树的最近公共祖先【lca模板 + 找路径相同】
High concurrency Architecture - read write separation
22.2.14 -- station B login with code -for circular list form - 'no attribute' - 'needs to be in path selenium screenshot deviation -crop clipping error -bytesio(), etc
How to build an efficient information warehouse
leetcode:556. Next larger element III [simulation + change as little as possible]
math_泰勒公式
Which do MySQL and Oracle learn?
Recommend a GIF processing artifact less than 300K - gifsicle (free download)
达梦数据库的物理备份和还原简解
ActiveMQ的基础
EGO Planner代码解析bspline_optimizer部分(3)
SQL custom collation
东数西算拉动千亿产业,敢啃“硬骨头”的存储厂商才更有机会
Simulation scheduling problem of SystemVerilog (1)
Zhengda futures news: soaring oil prices may continue to push up global inflation
EGO Planner代碼解析bspline_optimizer部分(1)
Analyse du Code du planificateur ego bspline Section Optimizer (1)
Suffix derivation based on query object fields