当前位置:网站首页>Why can't Google search page infinite?

Why can't Google search page infinite?

2022-06-11 01:06:00 Gegwu MMQ!!

 Insert picture description here
This is an interesting but little noticed problem .

When I use Google Search for MySQL When it comes to this keyword ,Google Provided only 13 Page search results , I changed it url The paging parameter of attempts to search for page 14 Page data , The following error message appears :

 Insert picture description here

Baidu search also does not provide unlimited paging , about MySQL key word , Baidu search provides 76 Page search results .

 Insert picture description here

Why not support infinite paging
Strong as Google Search for , Why not support infinite paging ? There are only two possibilities :

Can't do
It is not necessary to
「 Can't do 」 It's impossible , The only reason is 「 It is not necessary to 」.

First , When the first 1 Page search results do not have the content we need , We usually change keywords immediately , Instead of turning the page 2 page , Not to mention turning to 10 The page goes back . This is the first reason why it is unnecessary —— User demand is not strong .

secondly , The function of infinite paging is very performance consuming for search engines . You may feel strange , Turn to No 2 Page and turn to page 1000 The pages are all search , What's the difference ?

actually , A side effect of the design of high availability and high scalability of search engine is that it can not efficiently realize the unlimited paging function , Not being efficient means being able to , But the price is higher , This is a problem that all search engines will face , Professionally called 「 Deep paging 」. This is also the second reason why it is unnecessary —— High implementation cost .

Of course I don't know Google How to do the search of , So next I use ES(Elasticsearch) As an example to explain why deep paging is a headache for search engines .

Why take ES For example
Elasticsearch( Hereinafter referred to as" ES) The functions and Google And the functions provided by Baidu search are the same , And it is similar in the way of high availability and high scalability , Deep paging problems are caused by these similar optimization methods .

What is? ES
ES It's a full text search engine .

What the hell is a full-text search engine ?

Imagine a scene , You happen to hear a song with a particularly beautiful melody , I still feel the lingering sound after I go home , But you can only remember a few words in the lyrics :「 The edge of the umbrella 」. At this time, the search engine will work .

Using the search engine, you can get 「 The edge of the umbrella 」 All results of keywords , There is a term for these results , It's called documentation . And the search results are returned after sorting according to the relevance of documents and keywords . We get the definition of full-text search engine :

Full text search engine is to find relevant documents according to the contents of documents , A tool that returns search results in order of relevance

 Insert picture description here

Surfing the Internet for too long , We will gradually mistake the ability of the computer for our own ability , For example, we may mistakenly think that our brain is good at this kind of search . On the contrary , We are not good at full-text search .

for instance , If I say to you : In the Quiet Night . You may blurt out : abed, I see a silver light , The frost on the ground . look at the bright moon , Bow your head and think of your hometown . But if I ask you to say something with 「 month 」 The ancient poetry of , You must have worked hard on the membership fee .

Including the books we usually read , The directory itself is a search structure that conforms to the retrieval characteristics of our human brain , Let's go through the documentation ID Or the document title is a general sign to find a document , This structure is called forward index .

A directory is a forward index

Full text search engines are just the opposite , It is to find documents through the contents of documents , The flying flower order in the poetry conference is the full-text search engine of the human brain version .

Flying order is full-text search

The data structure that full-text search engines rely on is the famous inverted index (「 inverted 」 This word means that this data structure is just the opposite of our normal way of thinking ), It is a concrete implementation of the containment relationship between words and documents .

Word document matrix

Stop ! Can't go on , Finish the introduction in one sentence ES Well !

ES Is a use of inverted index data structure 、 Be able to find relevant documents according to their contents , A full-text search engine that returns search results in order of relevance

The secret of high availability —— copy (Replication)
High availability is an indicator that enterprise level services must consider , High availability inevitably involves clustering and distribution , Fortunately ES Natural support cluster mode , It is very simple to build a distributed system .

ES High service availability requires that if one of the nodes is down , Can not affect the normal search service . This means that the data stored on the suspended node , You must have a full backup on the other nodes . This is the concept of copy .

 Insert picture description here
 Insert picture description here

As shown in the figure above ,Node1 Master node ,Node2 and Node3 As a replica node, it saves the same data as the master node , In this way, the failure of any node will not affect the business search . Meet the high availability requirements of services .

But there's a fatal problem , Unable to realize system expansion ! Even if you add another node , It can not help the capacity expansion of the whole system . Because each node completely saves all the document data .

therefore ,ES Introduced slicing (Shard) The concept of .

PB The foundation stone of level quantity —— Fragmentation (Shard)
ES Put each index (ES A collection of documents in , amount to MySQL In the table ) Divide into several pieces , Sharding will be distributed to different nodes as evenly as possible . For example, a cluster now has 3 A node , The index is divided into 5 A shard , The distribution method is roughly ( Because how to distribute it equally depends on ES) As shown in the figure below .
 Insert picture description here

Fragmentation

thus , The horizontal expansion of the cluster is very simple , Now let's add... To the cluster 2 Nodes , be ES The partition will be automatically balanced to each node :

 Insert picture description here

High availability + The elastic expansion
The copy and sharding functions work together to create ES Today's high availability and support PB Two advantages of level data volume .

Now let's use 3 Nodes as an example , Show that the number of slices is 5, The number of copies is 1 Under the circumstances ,ES The partition arrangement on different nodes :
 Insert picture description here

The distribution of the main partition and the sub partition

There is one caveat , In the example above, the primary partition and the corresponding replica partition do not appear on the same node , As for why , You can think about it for yourself .

Distributed storage of documents
ES How to determine which partition a document should be stored on ?

Through the above mapping algorithm ,ES Distribute the document data evenly in each partition , among routing The default is document id.

Besides , The contents of replica shards depend on the master shard for synchronization , The meaning of replica sharding is load balancing 、 The main partition position on the top that may hang up at any time , Become the new main partition .

Now the basics are over , Finally we can search .

ES The search mechanism of
A picture is worth a thousand words :

 Insert picture description here

When the client performs keyword search ,ES A load balancing policy will be used to select a node as the coordination node (Coordinating Node) Accept the request , Let's assume that the choice is Node3 node ;
Node3 The node will be in 10 The primary and secondary slices are randomly selected 5 A shard ( All partitions must be able to contain all contents , And can't repeat ), send out search request;
Selected 5 After the query is executed and sorted, the results are returned to Node3 node ;
Node3 Node consolidation 5 Results returned by sharding , After sorting again, get the result set of the corresponding page and return it to the client .
notes : actually ES Our search is divided into Query Phase and Fetch There are two steps in the phase , stay Query The documents are returned in each stage Id And sort values ,Fetch Phase by document Id Go to the corresponding partition to obtain the document details , The picture and text above simplify this , Please know .

Now consider client-side fetch 990~1000 When it comes to documentation ,ES How to give correct search results in the case of fragment storage .

obtain 990~1000 When it comes to documentation ,ES Under each partition, you need to obtain 1000 A document , Then from Coordinating Node Aggregate the results of all slices , Then sort the correlation , Finally, the correlation order is selected in 990~1000 Of 10 Documents .
 Insert picture description here

Deep paging

The deeper the pages , Each node processes more documents , The more memory it takes up , The longer it takes , This is why search engine vendors usually do not provide deep paging , They don't have to waste performance on features that don't have strong customer demand .

原网站

版权声明
本文为[Gegwu MMQ!!]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206102355114521.html