当前位置:网站首页>Inverted index of ES underlying principle
Inverted index of ES underlying principle
2022-07-07 12:13:00 【Talk about duoxiansen】
Catalog
One 、ElasticSearch Framework principle
1、ElasticSearch Node type of cluster
5、 Use disk cache to retrieve in real time
6、translog Provide disk control
3、 Simple example of inverted index
One 、ElasticSearch Framework principle
1、ElasticSearch Node type of cluster
Elasticsearch An example of is a node , A group of nodes form a cluster .Elasticsearch Nodes in the cluster can be configured in three different ways :
(1)Master node
Master Node control Elasticsearch colony , And be responsible for creating / Delete index , Track which nodes are part of the cluster , And assign partitions to these nodes .
(2)Data node
Data nodes are used to store data and inverted indexes .
(3)Client node
If you will node.master and node.data Set to false, Then configure the node as a client node , And act as a load balancer , Route incoming requests to different nodes in the cluster .
If you are connected to a node as a client , This node is called the coordination node (coordinating node). The coordination node routes the client request to the node of the corresponding partition in the cluster . For read requests , The coordination node selects different partitions each time to provide requests to balance the load .
(4) The storage model
Use a data structure called inverted index , Used to provide low latency search results
Be careful : The operation of indexing data will only occur in the primary partition (primary shard) On , It will not happen in the fragmented copy (Replica) On . If the node to which the request for index data is sent does not have an appropriate partition or the partition is a replica , Then the request will be forwarded to the node containing the primary partition .
2、 Immutability
The inverted index written to disk is immutable , It has the following advantages :
① No need to lock , Improve concurrency , Avoid lock problems
② The data remains the same , Keep it all the time os cache in , as long as cache Enough memory
③filter cache It's always in memory , Because the data doesn't change
④ Can be compressed , save cpu and io expenses
Disadvantage :
① Rebuild the entire index every time
3、 Write and create
1) Modulo operation is performed by the number of main partitions in the index , To determine which fragment the document should be indexed to . shard = hash(document_id) % (num_of_primary_shards)
2) When a node receives a request from the coordinating node , The request is written to translog, And add the document to the memory buffer .
If the request is successful on the main slice , The request will be sent in parallel to the replica shard . Only on all master and replica tiles translog By fsync’ed after , The client will receive the confirmation that the request is successful .
- The memory buffer is refreshed at fixed intervals ( The default is 1 second ), And write the content to a new segment in the file system cache . The content of this paragraph has not been fsync’ed( Not written to the file system ), Segments are open , Content can be used to search .
- translog Be emptied , And the file system cache every 30 Every minute fsync, Or when translog Do it once when it gets too big fsync. This process takes place in Elasticsearch called flush. During refresh , The memory buffer is cleared , The content is written to a new file segment (segment). When file segments are fsync’ed And refresh to disk , A new submission point will be created ( In fact, it will update the file offset , The file system does this automatically ). old translog Be deleted , A new start .

4、 Delete and update
Delete : Each submission point includes a .del file , Contains documents that have been deleted on the segment when a document is deleted , It is actually just .del The file is marked for deletion , It can also match queries , But it will be deleted from the result before it finally returns .
to update : The old version of the document is marked for deletion , The new version of the document is indexed in the new segment
5、 Use disk cache to retrieve in real time
1) The newly received data is written into the new index file , Generating inverted indexes is called an end --segment
2) Use one commit Document all in the index segment
3) New data enters memory buffer in
4) Memory buffer Make a new one segment, Brush into the file system cache ,ES You can detect new segment
5) The file system cache is really synchronized to disk ,commit File update
6、translog Provide disk control
To prevent loss ( Host error 、 Disk failure ). When ES Write data to buffer in , It also recorded translog journal
Two 、 Inverted index
1、 word ---- Document matrix
file 1 | file 2 | file 3 | file 4 | file 5 | file 6 | |
word 1 |
|
| ||||
word 2 |
|
| ||||
word 3 |
|
| ||||
word 4 |
|
| ||||
word 5 |
| |||||
word 6 |
|
2、 Inverted index
1)、 Word dictionary (Lexicon): The usual index unit of a search engine is the word , A word dictionary is a string collection of all the words that have appeared in a document collection , Each index entry in a word dictionary records some information about the word itself and points to “ Inverted list ” The pointer to .
2)、 Inverted list (PostingList): The inverted list records the document list of all documents where a word has appeared and the location information of the word in the document , Each record is called an inverted entry (Posting). According to the inverted list , You know which documents contain a particular word .

3、 Simple example of inverted index
Suppose the document collection contains four documents , Create an inverted index of this document collection .
Document number | Document content |
1 | The father of Google Maps job hopping Facebook |
2 | The father of Google Maps joined Facebook |
3 | Google Maps founder Lars left Google to join Facebook |
4 | Lars, the father of Google maps, joins social networking sites Facebook |
1)、 Get keywords — Word segmentation
Meaningless :“ Of ”,“ yes ”,“in”,“too”;
Punctuation 、 Space ;
Uniform case :“a”,“A”. Restore words lives-->live
2)、 Build an indexed list
Article location 、 Frequency of occurrence 、 Position of appearance
3)、 Application reason
Time complexity --1 second
word ID | word | Document frequency | Inverted list (DocID:TF:<Position>) |
1 | 4 | (1:1:<1>),(2:1:<1>),(3:2:<1><6>),(4:1:<1>) | |
2 | Map | 4 | (1:1:<2>),(2:1:<2>),(3:1:<2>),(4:1:<2>) |
3 | The father of | 3 | (1:1:<3>),(2:1:<3>),(4:1:<3>) |
4 | job-hopping | 1 | (1:1:<4>) |
5 | 4 | (1:1:<5>),(2:1:<5>),(3:1:<8>),(4:1:<8>) | |
6 | To join in | 2 | (2:1:<4>),(4:1:<5>) |
7 | founder | 1 | (3:1:<3>) |
8 | Lars | 2 | (3:1:<4>),(4:1:<4>) |
9 | Leave | 1 | (3;1:<5>) |
10 | social contact | 1 | (4:1:<6>) |
11 | Website | 1 | (4:1:<7>) |
With this index system , Search engine can easily respond to user's query , For example, users input query words “Facebook”, Search system search inverted index , You can read the document that contains the word , These documents are the search results provided to users , Using word frequency information 、 Document frequency information can sort these candidate search results , Calculate document and query similarity , Output according to the similarity score from high to low , This is part of the internal process of the search system .
边栏推荐
- Superscalar processor design yaoyongbin Chapter 10 instruction submission excerpt
- Let digital manage inventory
- 111. Network security penetration test - [privilege escalation 9] - [windows 2008 R2 kernel overflow privilege escalation]
- 千人規模互聯網公司研發效能成功之路
- C#中在路径前加@的作用
- Matlab implementation of Huffman coding and decoding with GUI interface
- Hi3516全系统类型烧录教程
- How to connect 5V serial port to 3.3V MCU serial port?
- CMU15445 (Fall 2019) 之 Project#2 - Hash Table 详解
- SwiftUI Swift 内功之 Swift 中使用不透明类型的 5 个技巧
猜你喜欢

Summed up 200 Classic machine learning interview questions (with reference answers)

Sonar:Cognitive Complexity认知复杂度

Solve the problem that vscode can only open two tabs

@Bean与@Component用在同一个类上,会怎么样?
![108.网络安全渗透测试—[权限提升篇6]—[Windows内核溢出提权]](/img/c0/8a7b52c46eadd27cf4784ab2f32002.png)
108.网络安全渗透测试—[权限提升篇6]—[Windows内核溢出提权]

College entrance examination composition, high-frequency mention of science and Technology

数据库系统原理与应用教程(010)—— 概念模型与数据模型练习题

总结了200道经典的机器学习面试题(附参考答案)

清华姚班程序员,网上征婚被骂?

Camera calibration (2): summary of monocular camera calibration
随机推荐
【滤波跟踪】基于matlab扩展卡尔曼滤波EKF和无迹卡尔曼滤波UKF比较【含Matlab源码 1933期】
即刻报名|飞桨黑客马拉松第三期盛夏登场,等你挑战
【最短路】ACwing 1127. 香甜的黄油(堆优化的dijsktra或spfa)
清华姚班程序员,网上征婚被骂?
The Oracle message permission under the local Navicat connection liunx is insufficient
[filter tracking] strapdown inertial navigation simulation based on MATLAB [including Matlab source code 1935]
Solve the problem that vscode can only open two tabs
Mise en œuvre du codage Huffman et du décodage avec interface graphique par MATLAB
@What happens if bean and @component are used on the same class?
DOM parsing XML error: content is not allowed in Prolog
小红书微服务框架及治理等云原生业务架构演进案例
数据库系统原理与应用教程(009)—— 概念模型与数据模型
College entrance examination composition, high-frequency mention of science and Technology
关于 Web Content-Security-Policy Directive 通过 meta 元素指定的一些测试用例
Fleet tutorial 14 basic introduction to listtile (tutorial includes source code)
跨域问题解决方案
千人规模互联网公司研发效能成功之路
Mastering the new functions of swiftui 4 weatherkit and swift charts
SwiftUI Swift 内功之 Swift 中使用不透明类型的 5 个技巧
[texture feature extraction] LBP image texture feature extraction based on MATLAB local binary mode [including Matlab source code 1931]