当前位置：网站首页>10W word segmentation searches per second, the product manager raised another demand!!! (Collection)

10W word segmentation searches per second, the product manager raised another demand!!! (Collection)

2022-07-07 05:57:00 【58 Shen Jian】

Unreasonable demands , How can it be done easily ？
The article is longer , It is recommended to collect... In advance .

Probably 99% My classmates don't do search engine , but 99% Students must have realized the retrieval function . Search for , retrieval , What technologies are included in this , I hope this article can give you some inspiration .

Need one ： I want to be a whole network search engine , Not complicated , Just like Baidu , Can it be online in two months ？

Cut in How about the structure and process of search engine ？

The macro structure of the whole network search engine is shown in the figure above , The core subsystem is mainly divided into three parts （ The pink part ）：

（1）spider Reptile system ;

（2）search&index Build index and query index system , The system is divided into two parts ：

- Part of it is used to generate index data build_index

- Part of it is used to query index data search_index

（3）rank Scoring and sorting system ;

The core data is mainly divided into two parts （ Purple part ）：

（1）web Web Library ;

（2）index Index data ;

The business characteristics of the whole network search engine determine , This is a “ write in ” and “ retrieval ” Separate systems .

How writing is implemented ？

System composition ： from spider And search&index Two systems are completed .

Input ： Web pages generated by webmasters .

Output ： Forward inverted index data .

technological process ： As in the frame composition 1,2,3,4：

（1）spider Grab the Internet pages ;

（2）spider Store Internet pages in the Web Library （ This requires a lot of storage , To store almost the entire “ web ” Mirror image ）;

（3）build_index Read data from the Web Library , Complete the participle ;

（4）build_index Generate inverted index ;

How is retrieval implemented ？

System composition ： from search&index And rank Two systems are completed .

Input ： User's search terms .

Output ： The first page of the search results in order .

technological process ： As in the frame composition a,b,c,d：

（a）search_index Get users' search terms , Complete the participle ;

（b）search_index Query inverted index , get “ Character matching ” Webpage , This is the result of preliminary screening ;

（c）rank Score and sort the results of the preliminary screening ;

（d）rank Return the result of the first page after sorting ;

Station How about the structure and process of search engine ？

After all, there are only a few companies that do whole web search , What most companies want to achieve is just a website search , In the same city 100 Take the search of 100 million posts as an example , Its overall structure is as follows ：

The macro structure of the search engine is shown in the figure above , Compared with the macro structure of the whole network search engine , The difference is only written in ：

（1） The need of whole network search spider To capture data passively ;

（2） Site search is data generated by internal system , for example “ Publishing system ” Will actively push the generated posts to build_data System ;

Voice over ： It looks like “ Very small ” The difference of , The architecture is much less difficult to implement , How to search the whole web “ real time ” Find out “ Total quantity ” It's very difficult to find the right web page , And the station search is easy to get all the data in real time .

about spider、search&index、rank Three systems ：

（1）spider and search&index It's an engineering system ;

（2）rank It's business and business 、 Close strategy 、 Algorithm related systems , The main difference in search experience is , And business 、 It takes time to optimize the strategy , The inspiration here is ：

- Google Experience ratio Baidu good , It's the former rank frigging awesome

- Domestic Internet companies （ for example 360） In a short time, we need to experience and surpass Baidu Search engine , It's hard , It really takes time to accumulate

The previous content is too macro , In order to take care of most of the students who have not done search engine , Data structure and algorithm part is from forward index 、 The inverted index starts a little bit .

What is a forward index? （forward index）？

in short , from key The process of querying entities , Using forward index .

for example , User table ：

t_user(uid, name, passwd, age, sex)

from uid The process of querying the whole line , Is the forward index query .

For example , Web Library ：

t_web_page(url, page_content)

from url The process of querying the whole web page , Is also a forward index query. .

After web content segmentation ,page_content Will correspond to a set of participles list<item>.

Simple , Forward index can be understood as ：

Map<url, list<item>>

Can be created by the web url Quickly find a data structure for content .

Voice over ： Time complexity can be thought of as O(1).

What is inverted index （inverted index）？

Contrary to forward index , from item Inquire about key The process of , Use inverted index .

For web search , Inverted index can be understood as ：

Map<item, list<url>>

It can quickly find the data structure of the web page containing the query word by the query word .

Voice over ： Time complexity is also O(1).

for instance , Suppose there is 3 Pages ：

url1 -> “ I love Beijing ”

url2 -> “ I love going home ”

url3 -> “ Home is beautiful ”

This is a Forward index ：

Map<url, page_content>.

After the participle ：

url1 -> { I , Love , Beijing }

url2 -> { I , Love , home }

url3 -> { home , happy }

This is a Forward index after segmentation ：

Map<url, list<item>>.

Inverted index after word segmentation ：

I -> {url1, url2}

Love -> {url1, url2}

Beijing -> {url1}

home -> {url2, url3}

happy -> {url3}

By key words item Quickly find the web page containing the query word Map<item, list<url>> Namely Inverted index .

Voice over ： I see! , Word to url The process of , It's inverted index .

Forward index and inverted index are spider and build_index The system establishes a good data structure in advance , Why use these two data structures , Because it can be implemented quickly “ User web search ” demand .

Voice over , Business requirements determine architecture implementation , It's quick to find out .

What is the retrieval process like ？

Suppose the search term is “ I love ”：

（1） participle ,“ I love ” Participle { I , Love }, The time complexity is O(1);

（2） After each participle item, Query from inverted index contains this item The web page of list<url>, Time complexity is also O(1)：

I -> {url1, url2}

Love -> {url1, url2}

（3） seek list<url> Intersection , It's the result page that matches all the query words , For this example ,{url1, url2} This is the final query result ;

Voice over ： The retrieval process is also very simple ： participle , Check the inverted index , Finding the intersection of result sets .

Is it over ？ It's not , The time complexity of word segmentation and inverted query is O(1), The time complexity of the whole search depends on “ seek list<url> Intersection ”, The problem is to find the intersection of two sets .

Character type url Not conducive to storage and Computing , Generally speaking, each url There will be a numerical type url_id To mark , For convenience of description ,list<url> Unified use list<url_id> replace .

list1 and list2, How to find intersection ？

Scheme 1 ：for * for, Earth way , Time complexity O(n*n)

Every search term hits a lot of pages ,O(n*n) The complexity of is obviously unacceptable . Inverted index can be used for sorting and preprocessing at the beginning of creation , The problem translates into two orderly list Find the intersection , It's much more convenient .

Voice over ： Stupid way .

Option two ： Orderly list Find the intersection , Zipper method

Ordered set 1{1,3,5,7,8,9}

Ordered set 2{2,3,4,5,6,7}

Two pointers point to the first element , Compare the size of the elements ：

（1） If the same , Put the result set , Move a pointer at will ;

（2） otherwise , Move a pointer with a smaller value , Until the end of the team ;

The advantage of this method is ：

（1） The elements in the collection are compared at most once , The time complexity is O(n);

（2） Multiple ordered sets can be made at the same time , This applies to more than one participle item seek url_id intersection ;

This method is like the gears on both sides of a zipper , One by one comparison is like a zipper , So it's called zipper method ;

Voice over ： Inverted index is initialized ahead of time , You can use “ Orderly ” This feature .

Option three ： Parallel optimization by buckets

When the amount of data is large ,url_id To cut horizontally + Parallel operation is a common optimization method , If you can list1<url_id> and list2<url_id> Divide into bucket intervals , Each interval uses multithreading to find intersection in parallel , The union of the result sets of each thread , As the final result set , Can greatly reduce the execution time .

give an example ：

Ordered set 1{1,3,5,7,8,9, 10,30,50,70,80,90}

Ordered set 2{2,3,4,5,6,7, 20,30,40,50,60,70}

Find the intersection , Split the buckets first ：

bucket 1 For the range of [1, 9]

bucket 2 For the range of [10, 100]

bucket 3 For the range of [101, max_int]

therefore ：

aggregate 1 Split it into

aggregate a{1,3,5,7,8,9}

aggregate b{10,30,50,70,80,90}

aggregate c{}

aggregate 2 Split it into

aggregate d{2,3,4,5,6,7}

aggregate e{20,30,40,50,60,70}

aggregate e{}

The amount of data in each bucket is greatly reduced , And there is no repeating element in each bucket , You can use multithreaded parallel computing ：

bucket 1 A collection within a And collection d The intersection of x{3,5,7}

bucket 2 A collection within b And collection e The intersection of y{30, 50, 70}

bucket 3 A collection within c And collection d The intersection of z{}

Final , aggregate 1 And collection 2 Intersection , yes x And y And z Union , I.e. set {3,5,7,30,50,70}.

Voice over ： Multithreading 、 Horizontal segmentation is a common optimization method .

Option four ：bitmap Optimize again

After the data is split horizontally , The data in each bucket must be in a range , If the set matches this characteristic , You can use bitmap To represent a set ：

Pictured above , hypothesis set1{1,3,5,7,8,9} and set2{2,3,4,5,6,7} All of the elements in the bucket value [1, 16] Within the scope of , It can be used 16 individual bit To describe these two sets , The elements of the original collection x, In this 16bitmap No x individual bit by 1, Now two bitmap Find the intersection , Just put two bitmap Conduct “ And ” operation , Result set bitmap Of 3,5,7 Is it 1, Indicates that the intersection of the original set is {3,5,7}.

Divide the barrels horizontally ,bitmap After the optimization , Can greatly improve the efficiency of intersection , But time complexity is still O(n).bitmap It takes a lot of continuous space , Large memory footprint .

Voice over ：bitmap Can represent a set , It's very fast to find the intersection of sets .

Option five ： Jump watch skiplist

The intersection of the ordered list set , Jump table is the most commonly used data structure , It can get the complexity of intersection from O(n) It's close to O(log(n)).

aggregate 1{1,2,3,4,20,21,22,23,50,60,70}

aggregate 2{50,70}

Ask for intersection , If you use zipper method , Will find 1,2,3,4,20,21,22,23 All have to be traversed invalid once , Every element has to be compared , The time complexity is O(n), Can you compare each time “ Skip some elements ” Well ？

The jump watch appears ：

aggregate 1{1,2,3,4,20,21,22,23,50,60,70} When creating a jump table , There is only one level {1,20,50} Three elements , The second level is the same as the common list .

aggregate 2{50,70} Because there are fewer elements , Only one level of common linked list has been established .

such , In implementation “ zipper ” In the process of finding intersection ,set1 The pointer of can be controlled by 1 Jump to the 20 Jump to 50, Can skip many elements in the middle , There is no need to compare one by one , The time complexity of finding intersection is approximate O(log(n)), This is a common algorithm in search engines .

A brief summary ：

（1） Whole network search engine system from spider, search&index, rank Three subsystems ;

（2） On site search engine The difference with the whole web search engine is , One less. spider Subsystem ;

（3）spider and search&index The system is two engineering system ,rank The optimization of the system needs a long time of tuning and accumulation ;

（4） Forward index （forward index） It's made up of web pages url_id Quickly find the content of the web page after word segmentation list<item> The process of ;

（5） Inverted index （inverted index） It's made up of participles item Quickly find the web page containing this participle list<url_id> The process of ;

（6） The process of user retrieval , It's participle first , Find each one again item Corresponding list<url_id>, Finally, the process of intersection of sets is carried out ;

（7） Intersection of ordered sets The way to do this is ：

- double for Circulation , Time complexity O(n*n)

- Zipper method , Time complexity O(n)

- Divide the barrels horizontally , Multithreading parallel

- bitmap, Greatly improve the parallelism of operation , Time complexity O(n)

- Jump watch , The time complexity is O(log(n))

Demand two ： I want to do a content retrieval function , Not complicated ,100 100 million data , Per second 10 Ten thousand inquiries , Can it be online in two weeks ？

Most engineers may not have been exposed to “ Search kernel ”, But the Internet business , It will basically involve “ retrieval ” function . Take the post business scenario in the same city as an example , The title of the post , The content of the post has a strong user search demand , In the business 、 Traffic 、 The phases of increasing concurrency , How to realize the retrieval requirements ？

Original stage -LIKE

The start-up phase , This method is often used to achieve .

The data may be stored in the database in this way ：

t_tiezi(tid, title, content)

Meet the title 、 Content retrieval requirements can be achieved through LIKE Realization ：

select tid from t_tiezi where content like ‘% Tiantongyuan %’

This approach can really meet business needs quickly , The problems are also obvious ：

（1） Low efficiency , A full table scan is required each time , Large amount of computation , Concurrent high time cpu Easy to 100%;

（2） Participle... Is not supported ;

Primary stage - Full-text index

How to improve efficiency quickly , Support participle , And the impact on the original system architecture is as small as possible , The first thing I thought of was building a full-text index ：

alter table t_tiezi add fulltext(title,content)

Use match and against Realize the query requirements on the index field .

Full text indexing can quickly meet the needs of business word segmentation , And quickly improve performance （ Inverted after participle , At least don't scan the whole table ）, But there are also some problems ：

（1） Only applicable to MyISAM;

（2） Because the full-text index uses the characteristics of the database , Search needs and common CURD Requirement coupling in database ： When retrieval requirements are concurrent , Possible impact CURD Request ;CURD Concurrent large time , Retrieval will be very slow ;

（3） Millions of data , Performance will still be significantly reduced , Query return time is very long , The business is hard to accept ;

（4） It is difficult to expand horizontally ;

Intermediate stage - Open source external index

In order to solve the limitation of full text search , When the amount of data increases to millions , Ten million level , We need to consider the external index . The design of external index The core idea yes ： The index data is separated from the original data , The former meets the needs of search , The latter is satisfied with CURD demand , Through a certain mechanism （ Double write , notice , Rebuild regularly ） To ensure data consistency .

Raw data can continue to be used Mysql To store , How to implement external index ？Solr,Lucene,ES Are common open source solutions . among ,ES（ElasticSearch） It's the most popular .

Lucene Is good , What are the potential disadvantages ：

（1）Lucene It's just a library , You need to do your own service , High availability on your own / Scalable / Load balancing and other complex characteristics ;

（2）Lucene Only support Java, If you want to support other languages , You have to serve yourself ;

（3）Lucene unfriendly , It can be deadly , Very complicated , Users often need to know more about search to understand how it works , In order to shield its complexity , We still have to do our own service ;

To improve Lucene The shortcomings of , The solutions are “ Encapsulating an interface friendly service , Mask underlying complexity ”, Hence the ES：

（1）ES It's an example. Lucene Search for the kernel , Provide REStful Service of interface ;

（2）ES It can support a large amount of information storage , Supports highly concurrent search requests ;

（3）ES Support clusters , Shield users from high availability / Scalable / Load balancing and other complex characteristics ;

at present , Fast Dog Taxi use ES As a core search service , Realize all kinds of search requirements in the business , among ：

（1） The largest amount of data “ Interface time consuming data collection ” demand , The amount of data is about 10 Million or so ;

（2） The most concurrent “ Longitude and latitude , Location search ” demand , The average online concurrency is about 2000 about , Pressure measurement data concurrency in 8000 about ;

therefore ,ES It's totally satisfying 10 Billion data volume ,5k Common search business requirements for throughput .

Advanced stage - Self developed search engine

When the amount of data increases further , achieve 10 Billion 、100 Billion data volume ; The concurrency also increased , Up to... Per second 10 10000 throughput ; Business personality is also gradually increasing , We need to develop our own search engine , Customized implementation of the search kernel .

To the stage of customized self-developed search engine , Huge amount of data 、 High concurrency is the focus of design , In order to achieve “ Unlimited capacity 、 Infinite concurrency ” The needs of , Architecture design needs to focus on “ Extensibility ”, Strive to achieve ： More machines can expand capacity （ Data volume + Concurrency ）.

Self developed search engine in the same city E-search The preliminary structure is as follows ：

（1） The upper proxy（ Pink ） yes Access cluster , For the external portal , Accept search request , Its statelessness can ensure that the increase of machines can expand proxy Cluster performance ;

（2） Middle level merger（ The light blue ） yes Logical cluster , It is mainly used to realize search merge , And ranking , Business related rank It's implemented at this level , Its statelessness can also guarantee the expansion of the machine merger Cluster performance ;

（3） Bottom searcher（ Large dark red frame ） It's a search cluster , Services and index data are deployed on the same machine , The index data can be loaded into memory when the service is started , Request access from memory load data , The access speed is very fast ：

- In order to satisfy the Scalability of data capacity , The index data is sliced horizontally , Increase the number of shards , You can scale performance infinitely , Pictured above searcher Divided into 4 Group

- In order to satisfy the Performance scalability of a piece of data , The same data is redundant , In theory, if you add machines, you can expand performance infinitely , As shown in the figure above searcher It's redundant again 2 Share

So designed , More machines can carry more data , Respond to higher concurrency .

A brief summary ：

To meet the needs of the search business , With the growth of data and concurrency , Search architecture generally goes through these stages ：

（1） Original stage -LIKE;

（2） Primary stage - Full-text index ;

（3） Intermediate stage - Open source external index ;

（4） Advanced stage - Self developed search engine ;

Demand 3 ： Timeliness of retrieval , It's important for the user experience , Must retrieve 5 News minutes ago ,1 Post posted seconds ago , It's not complicated ？

Last high-level topic , About Real time search ：

Why baidu can retrieve in real time 5 New news that came out minutes ago ？ Why can the same city retrieve in real time 1 Posts posted two seconds ago ？

What are the key points of real-time search engine system architecture ？

Large amount of data 、 In order to ensure the real-time performance of the search engine in the case of high concurrency , Two key points in architecture design ：

（1） Index rating ;

（2）dump&merge;

First , In the case of a very large amount of data , In order to ensure the efficient retrieval efficiency of inverted index , Any update to the data , The index is not modified in real time .

Voice over ： because , Once there's debris , It will greatly reduce the retrieval efficiency .

Since index data cannot be modified in real time , How to ensure that the latest web pages can be indexed ？

Index rating , It is divided into full database 、 Daily increment Library 、 Hourly increment Library .

As shown above ：

（1）300 100 million data in the full index library ;

（2）1000 ten thousand 1 The data modified in the day is in the day database ;

（3）50 ten thousand 1 The data modified within the hour is in the hour library ;

When When a modification request occurs , Only the lowest level index is manipulated , For example, hour bank .

When When a query request occurs , Will query all levels of index at the same time , Combine the results , Get the latest data ：

（1） A full library is a tightly stored index , No debris , Fast ;

（2） Tianku is compact storage , Fast ;

（3） Small amount of data in hour bank , Too fast ;

Hierarchical index can guarantee real-time performance , that , New questions are coming , When will the data of hour database be reflected in the day database , When will the data in the sky database be reflected in the full database ？

dump&merge, Export and merge of index , It's done by these two asynchronous tools ：

dumper： Export online data .

merger： Merge offline data into a higher level index .

Hour bank , Once an hour , Merge into tianku ;

Tianku , Once a day , Merge it into the full database ;

In this way, the amount of data in the hour database and the day database will not be particularly large ;

If the amount of data and concurrency is greater , It can also increase the weekly library , Buffer from the monthly bank .

A brief summary ：

Huge amount of data , High concurrency , Two key points of real-time search engine architecture ：

（1） Index rating ;

（2）dump&merge;

About “ Search for ” And “ retrieval ” The needs of , Are you satisfied ？

Architect's way - Share technical ideas

Related to recommend ：

《 You have to know RPC Kernel details （ Collection ）》

《 Whose encryption key , Write dead in code ？》

research ：

Have you received any abnormal needs ？

原网站

版权声明
本文为[58 Shen Jian]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207070033531108.html

当前位置：网站首页>10W word segmentation searches per second, the product manager raised another demand!!! (Collection)

10W word segmentation searches per second, the product manager raised another demand!!! (Collection)

边栏推荐

猜你喜欢

随机推荐