当前位置:网站首页>Inverted index of ES underlying principle

Inverted index of ES underlying principle

2022-07-07 12:13:00 Talk about duoxiansen

Catalog

One 、ElasticSearch Framework principle

1、ElasticSearch Node type of cluster

2、 Immutability

3、 Write and create

4、 Delete and update

5、 Use disk cache to retrieve in real time

6、translog Provide disk control

Two 、 Inverted index

1、 word ---- Document matrix

2、 Inverted index

3、 Simple example of inverted index


One 、ElasticSearch Framework principle

1、ElasticSearch Node type of cluster

Elasticsearch An example of is a node , A group of nodes form a cluster .Elasticsearch Nodes in the cluster can be configured in three different ways :

(1)Master node

Master Node control Elasticsearch colony , And be responsible for creating / Delete index , Track which nodes are part of the cluster , And assign partitions to these nodes .

(2)Data node

Data nodes are used to store data and inverted indexes .

(3)Client node

If you will node.master and node.data Set to false, Then configure the node as a client node , And act as a load balancer , Route incoming requests to different nodes in the cluster . 
If you are connected to a node as a client , This node is called the coordination node (coordinating node). The coordination node routes the client request to the node of the corresponding partition in the cluster . For read requests , The coordination node selects different partitions each time to provide requests to balance the load .

(4) The storage model

Use a data structure called inverted index , Used to provide low latency search results

Be careful : The operation of indexing data will only occur in the primary partition (primary shard) On , It will not happen in the fragmented copy (Replica) On . If the node to which the request for index data is sent does not have an appropriate partition or the partition is a replica , Then the request will be forwarded to the node containing the primary partition .

2、 Immutability

The inverted index written to disk is immutable , It has the following advantages :

① No need to lock , Improve concurrency , Avoid lock problems
② The data remains the same , Keep it all the time os cache in , as long as cache Enough memory
③filter cache It's always in memory , Because the data doesn't change
④ Can be compressed , save cpu and io expenses

Disadvantage :

① Rebuild the entire index every time

3、 Write and create

1) Modulo operation is performed by the number of main partitions in the index , To determine which fragment the document should be indexed to . 
shard = hash(document_id) % (num_of_primary_shards)

2) When a node receives a request from the coordinating node , The request is written to translog, And add the document to the memory buffer .

If the request is successful on the main slice , The request will be sent in parallel to the replica shard . Only on all master and replica tiles translog By fsync’ed after , The client will receive the confirmation that the request is successful .

  • The memory buffer is refreshed at fixed intervals ( The default is 1 second ), And write the content to a new segment in the file system cache . The content of this paragraph has not been fsync’ed( Not written to the file system ), Segments are open , Content can be used to search .
  • translog Be emptied , And the file system cache every 30 Every minute fsync, Or when translog Do it once when it gets too big fsync. This process takes place in Elasticsearch called flush. During refresh , The memory buffer is cleared , The content is written to a new file segment (segment). When file segments are fsync’ed And refresh to disk , A new submission point will be created ( In fact, it will update the file offset , The file system does this automatically ). old translog Be deleted , A new start .

4、 Delete and update

Delete : Each submission point includes a .del file , Contains documents that have been deleted on the segment when a document is deleted , It is actually just .del The file is marked for deletion , It can also match queries , But it will be deleted from the result before it finally returns .

to update : The old version of the document is marked for deletion , The new version of the document is indexed in the new segment

5、 Use disk cache to retrieve in real time

1) The newly received data is written into the new index file , Generating inverted indexes is called an end --segment

2) Use one commit Document all in the index segment

3) New data enters memory buffer in

4) Memory buffer Make a new one segment, Brush into the file system cache ,ES You can detect new segment

5) The file system cache is really synchronized to disk ,commit File update

6、translog Provide disk control

To prevent loss ( Host error 、 Disk failure ). When ES Write data to buffer in , It also recorded translog journal

Two 、 Inverted index

1、 word ---- Document matrix

file 1

file 2

file 3

file 4

file 5

file 6

word 1

  

  

word 2

  

  

word 3

  

  

word 4

  

  

word 5

  

word 6

  

2、 Inverted index

1)、 Word dictionary (Lexicon): The usual index unit of a search engine is the word , A word dictionary is a string collection of all the words that have appeared in a document collection , Each index entry in a word dictionary records some information about the word itself and points to “ Inverted list ” The pointer to .

2)、 Inverted list (PostingList): The inverted list records the document list of all documents where a word has appeared and the location information of the word in the document , Each record is called an inverted entry (Posting). According to the inverted list , You know which documents contain a particular word .

3、 Simple example of inverted index

Suppose the document collection contains four documents , Create an inverted index of this document collection .

Document number

Document content

1

The father of Google Maps job hopping Facebook

2

The father of Google Maps joined Facebook

3

Google Maps founder Lars left Google to join Facebook

4

Lars, the father of Google maps, joins social networking sites Facebook

1)、 Get keywords — Word segmentation

Meaningless :“ Of ”,“ yes ”,“in”,“too”;

Punctuation 、 Space ;

Uniform case :“a”,“A”. Restore words lives-->live

2)、 Build an indexed list

Article location 、 Frequency of occurrence 、 Position of appearance

3)、 Application reason

Time complexity --1 second

word ID

word

Document frequency

Inverted list (DocID:TF:<Position>)

1

Google

4

(1:1:<1>),(2:1:<1>),(3:2:<1><6>),(4:1:<1>)

2

Map

4

(1:1:<2>),(2:1:<2>),(3:1:<2>),(4:1:<2>)

3

The father of

3

(1:1:<3>),(2:1:<3>),(4:1:<3>)

4

job-hopping

1

(1:1:<4>)

5

Facebook

4

(1:1:<5>),(2:1:<5>),(3:1:<8>),(4:1:<8>)

6

To join in

2

(2:1:<4>),(4:1:<5>)

7

founder

1

(3:1:<3>)

8

Lars

2

(3:1:<4>),(4:1:<4>)

9

Leave

1

(3;1:<5>)

10

social contact

1

(4:1:<6>)

11

Website

1

(4:1:<7>)

With this index system , Search engine can easily respond to user's query , For example, users input query words “Facebook”, Search system search inverted index , You can read the document that contains the word , These documents are the search results provided to users , Using word frequency information 、 Document frequency information can sort these candidate search results , Calculate document and query similarity , Output according to the similarity score from high to low , This is part of the internal process of the search system .

原网站

版权声明
本文为[Talk about duoxiansen]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071022233274.html