当前位置:网站首页>Why can't Google search page infinite?

Why can't Google search page infinite?

2022-06-11 22:50:00 Cicada bathes in the wind

image-20220607160711838

This is an interesting but little noticed problem .

When I use Google Search for MySQL When it comes to this keyword ,Google Provided only 13 Page search results , I changed it url The paging parameter of attempts to search for page 14 Page data , The following error message appears :

Google Cannot page indefinitely

Baidu search also does not provide unlimited paging , about MySQL key word , Baidu search provides 76 Page search results .

 Baidu can't page without limit

Why not support infinite paging

Strong as Google Search for , Why not support infinite paging ? There are only two possibilities :

  • Can't do
  • It is not necessary to

「 Can't do 」 It's impossible , The only reason is 「 It is not necessary to 」.

First , When the first 1 Page search results do not have the content we need , We usually change keywords immediately , Instead of turning the page 2 page , Not to mention turning to 10 The page goes back . This is the first reason why it is unnecessary —— User demand is not strong .

secondly , The function of infinite paging is very performance consuming for search engines . You may feel strange , Turn to No 2 Page and turn to page 1000 The pages are all search , What's the difference ?

actually , A side effect of the design of high availability and high scalability of search engine is that it can not efficiently realize the unlimited paging function , Not being efficient means being able to , But the price is higher , This is a problem that all search engines will face , Professionally called 「 Deep paging 」. This is also the second reason why it is unnecessary —— High implementation cost .

Of course I don't know Google How to do the search of , So next I use ES(Elasticsearch) As an example to explain why deep paging is a headache for search engines .

Why take ES For example

Elasticsearch( Hereinafter referred to as" ES) The functions and Google And the functions provided by Baidu search are the same , And it is similar in the way of high availability and high scalability , Deep paging problems are caused by these similar optimization methods .

What is? ES

ES It's a full text search engine .

What the hell is a full-text search engine ?

Imagine a scene , You happen to hear a song with a particularly beautiful melody , I still feel the lingering sound after I go home , But you can only remember a few words in the lyrics :「 The edge of the umbrella 」. At this time, the search engine will work .

Using the search engine, you can get 「 The edge of the umbrella 」 All results of keywords , There is a term for these results , It's called documentation . And the search results are returned after sorting according to the relevance of documents and keywords . We get the definition of full-text search engine :

Full text search engine is Find relevant documents according to the contents of documents , And in accordance with the Correlation order A tool to return search results

 Insert picture description here

Surfing the Internet for too long , We will gradually mistake the ability of the computer for our own ability , For example, we may mistakenly think that our brain is good at this kind of search . On the contrary , We are not good at full-text search .

for instance , If I say to you : In the Quiet Night . You may blurt out : abed, I see a silver light , The frost on the ground . look at the bright moon , Bow your head and think of your hometown . But if I ask you to say something with 「 month 」 The ancient poetry of , You must have worked hard on the membership fee .

Including the books we usually read , The directory itself is a search structure that conforms to the retrieval characteristics of our human brain , Let's go through the documentation ID Or the document title is a general sign to find a document , This structure is called Forward index .

 A directory is a forward index

Full text search engines are just the opposite , It is to find documents through the contents of documents , The flying flower order in the poetry conference is the full-text search engine of the human brain version .

 Insert picture description here

The data structure that full-text search engines rely on is well-known Inverted index (「 inverted 」 This word means that this data structure is just the opposite of our normal way of thinking ), It is a concrete implementation of the containment relationship between words and documents .

 Word document matrix

Stop ! Can't go on , Finish the introduction in one sentence ES Well !

ES Is a Use inverted index data structure 、 Be able to find relevant documents according to their contents , And in accordance with the Correlation order Full text search engines that return search results

The secret of high availability —— copy (Replication)

High availability It is an indicator that enterprise level services must consider , High availability inevitably involves clustering and distribution , Fortunately ES Natural support cluster mode , It is very simple to build a distributed system .

ES High service availability requires that if one of the nodes is down , Can not affect the normal search service . This means that the data stored on the suspended node , Must be in Complete backups are left on other nodes . This is the concept of copy .

 copy

As shown in the figure above ,Node1 Master node ,Node2 and Node3 As a replica node, it saves the same data as the master node , In this way, the failure of any node will not affect the business search . Meet the high availability requirements of services .

But there's a fatal problem , Unable to realize system expansion ! Even if you add another node , It can not help the capacity expansion of the whole system . Because each node completely saves all the document data .

therefore ,ES Introduced slicing (Shard) The concept of .

PB The foundation stone of level quantity —— Fragmentation (Shard)

ES Put each index (ES A collection of documents in , amount to MySQL In the table ) Divide into several pieces , Slice will As evenly as possible Assign to different nodes . For example, a cluster now has 3 A node , The index is divided into 5 A shard , The distribution method is roughly ( Because how to distribute it equally depends on ES) As shown in the figure below .

 Fragmentation

thus , The horizontal expansion of the cluster is very simple , Now let's add... To the cluster 2 Nodes , be ES The partition will be automatically balanced to each node :

 Horizontal scaling

High availability + The elastic expansion

The copy and sharding functions work together to create ES Now High availability and Support PB Level data volume Two of the advantages of .

Now let's use 3 Nodes as an example , Show that the number of slices is 5, The number of copies is 1 Under the circumstances ,ES The partition arrangement on different nodes :

 The distribution of the main partition and the sub partition

There is one caveat , In the example above, the primary partition and the corresponding replica partition do not appear on the same node , As for why , You can think about it for yourself .

Distributed storage of documents

ES How to determine which partition a document should be stored on ?

Through the above mapping algorithm ,ES Distribute the document data evenly in each partition , among routing The default is document id.

Besides , The contents of replica shards depend on the master shard for synchronization , The meaning of replica sharding is load balancing 、 The main partition position on the top that may hang up at any time , Become the new main partition .

Now the basics are over , Finally we can search .

ES The search mechanism of

A picture is worth a thousand words :

es Search for

  1. When the client performs keyword search ,ES A load balancing policy will be used to select a node as the coordination node (Coordinating Node) Accept the request , Let's assume that the choice is Node3 node ;
  2. Node3 The node will be in 10 The primary and secondary slices are randomly selected 5 A shard ( All partitions must be able to contain all contents , And can't repeat ), send out search request;
  3. Selected 5 After the query is executed and sorted, the results are returned to Node3 node ;
  4. Node3 Node consolidation 5 Results returned by sharding , After sorting again, get the result set of the corresponding page and return it to the client .

notes : actually ES Our search is divided into Query Stage and Fetch Stage Two steps , stay Query Stage Each fragment returns the document Id And sort values ,Fetch Stage According to the document Id Go to the corresponding partition to obtain the document details , The picture and text above simplify this , Please know .

Now consider client-side fetch 990~1000 When it comes to documentation ,ES How to give correct search results in the case of fragment storage .

obtain 990~1000 When it comes to documentation ,ES Under each partition, you need to obtain 1000 A document , Then from Coordinating Node Aggregate the results of all slices , Then sort the correlation , Finally, the correlation order is selected in 990~1000 Of 10 Documents .

 Deep paging

The deeper the pages , Each node processes more documents , The more memory it takes up , The longer it takes , This is why search engine vendors usually do not provide deep paging , They don't have to waste performance on features that don't have strong customer demand .


official account : Cicadas bathe in the wind , Pay attention to me , Get more

End .

原网站

版权声明
本文为[Cicada bathes in the wind]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206112249383215.html