当前位置:网站首页>Storage principle inside mongodb

Storage principle inside mongodb

2022-07-07 13:11:00 cui_ yonghua

The basic chapter ( Can solve the problem of 80% The problem of ):

  1. MongoDB Overview 、 Application scenarios 、 Download mode 、 Connection mode and development history, etc

  2. MongoDB data type 、 Key concepts and shell Commonly used instructions

  3. MongoDB Various additions to documents 、 to update 、 Delete operation summary

  4. MongoDB Summary of various query operations

  5. MongoDB Summarize the various operations of the column

  6. MongoDB Summary of index operations in

Advanced :

  1. MongoDB Summary of aggregation operations

  2. MongoDB Import and export of 、 Backup recovery summary

  3. MongoDB Summary of user management

  4. MongoDB Copy ( Replica set ) summary

  5. MongoDB Slice summary

  6. MongoDB meet spark( Integration )

  7. MongoDB Internal storage principle

Other :

  1. python3 operation MongoDB Various cases of

  2. MongoDB Command summary

Storage engine

This article introduces the default storage engine WiredTiger

WiredTiger framework

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-m6gvgNNr-1657024774196)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1226)]

WiredTiger The write operation will be written first Cache, And persist to WAL(Write ahead log), Every time 60s Will do it once Checkpoint, Persist the current data , Every time , Generate a new snapshot .Wiredtiger Connection initialization , First, restore the data to the latest snapshot state , And then according to Checkpoint Restore data , To ensure storage reliability

btree And b+tree

Although queries that traverse data are relatively common , however MongoDB It is considered that querying a single data record is far more common than traversing data , because B The non leaf nodes of the tree can also store data , therefore The average random required to query a piece of data IO More times than B+ Few trees , Use B Treelike MongoDB In similar scenarios, the query speed will be faster than MySQL fast .

This is not to say MongoDB You cannot traverse the data , We are MongoDB You can also use a range to query a batch of records that meet the corresponding conditions , It just takes more time than MySQL Longer .MySQL It is considered that queries that traverse data are common , So it chose B+ Tree as the underlying data structure

cache

Internal caching and file system caching , By default, the internal cache fetches 50%(RAM-1 GB) or 256M The greater , The file system cache uses all currently available RAM.
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-f6pKa5SR-1657024774197)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1227)]

Wiredtiger Of Cache use Btree How to organize , Every Btree The node is a page,root page yes btree The root node ,internal page yes btree The middle index node of ,leaf page It's a leaf node that actually stores data ;btree The data to page Load or write to disk from disk on demand in units ,btree Each page In the document extent form ( By document offset + size identification ) Storage

page

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-MUI3Noms-1657024774198)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1228)]

ROW_ARRAY: Each array unit (wt_row) It's stored in this kv row Stored on disk page kv cell Location and encoding method of row set data buffer offset ( This location and encoding is in WT It is defined as a wt_cell object ), By offsetting the location information with this information, you can access the same... In the buffer K/V Content value
ROW_UPDATE_ARRAY: One mvcc list object ,mvcc_list And wt_row It's one-to-one ,mvcc list It's stored to wt_row Modified value , The modified values include value update and value deletion , It's a one-way list without locks

Write operations

  1. Traverse btree, Find what needs to be updated page
  2. If cache There is no corresponding page, Will load from disk page, Key value pairs are stored WT_ROW
  3. If it is insert operation , to update WT_INSERT, If it is update/delete operation , to update WT_UPDATE
  4. if necessary , Write the operation record to journal

Let's illustrate with an example :
If one page Stored a [0,100] Of key Range , The line originally stored on the disk key=2, 10 ,20, 30 , 50, 80, 90, Their values are value = 102, 110, 120, 130, 150, 180, 190.
stay page After data is read from disk to memory , Respectively for key=2 Of value Two changes have been made , The two modified values are respectively 402,502. Yes key = 20 ,50 Of value Made a change , The modified value = 122, 155, There is distribution after insert New key = 3,5, 41, 99,value = 203,205,241,299.
So in memory page This is how the data is organized as shown in the figure below :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ew70Kjzs-1657024774198)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1229)]

The next two wt_row It may not be continuous , New units can be inserted between them , for example row1(key = 2) and row2(key=10) You can insert 3 and 5, these two items. row There needs to be a sorted data structure between (WT use skiplist data structure ) To store the inserted K/V, You just need one skiplist An array of objects page_insert_array And row array Corresponding . Here's the thing to note chart 6 In the middle of the red box skiplist8, It's for storage row1(key=2) Before the scope insert data , If there is key =1 The data of insert, This data will be added to skiplist8 among .

So in the picture row And insert skiplist The corresponding relationship between :

  • row1 The previous range corresponds to insert yes skiplist8
  • row1 and row2 Between the corresponding insert yes skiplist1
  • row2 and row3 Between the corresponding insert yes skiplist3
  • row7 The range after that corresponds to insert yes skiplist7

checkpoint

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-FAoNj9By-1657024774199)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1230)]

One Checkpoit It contains the following metadata :
root page Address , The address is determined by the document offset,size And content checksum form
alloc extent list Address , Store since last checkpoint Newly assigned extent list
discard extent list Address , Store since last checkpoint Discarded extent list
available extent list Address , Store allocatable extent list , Only the latest checkpoint Include this list
file size To restore to this checkpoint The state of , Will file truncate To file size that will do

WAL(journal)

The log file records from the previous checkpoint After the actual operation , This document is every 100ms Or the file size reaches 100M Just synchronize from cache to disk

The whole relationship

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-MVHYqJxR-1657024774199)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1231)]

Storage engine principle supplement

Reference resources 1
Reference resources 2

Distributed storage

framework

Architecture diagram :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-oFTkWPep-1657024774200)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1232)]

Write data flow :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-LBfsoU1L-1657024774200)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1233)]

Read data flow :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-IDG49n7y-1657024774201)(evernotecid://B1CD39FE-B044-413D-A086-0649DB3F0070/appyinxiangcom/26430792/ENResource/p1234)]

原网站

版权声明
本文为[cui_ yonghua]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071117313954.html