当前位置:网站首页>Analysis of high frequency interview questions in massive data processing

Analysis of high frequency interview questions in massive data processing

2022-06-11 09:24:00 Menon research monk

Preface

Hardware expansion is difficult to meet the needs of massive data processing , How to use the existing conditions for massive information processing
Massive information processing has increasingly become a new highlight in the current written interview for programmers
The main reference books are :《java Programmer interview book 》

The basic method

By querying all aspects of online knowledge points , And read some related books

Common methods are Hash Law 、Bit - map Law 、Bloom filter Law 、 Database optimization method 、 Inverted index 、 External ranking method 、Trie Trees 、 Pile up 、 Double bucket method and MapRe-duce FA et al

1.1 The hash algorithm

Also known as hashing ( The mapping relationship ). The key in the data element is key, Calculate by hash function hash ( key), That is to calculate its storage address , So as to perform some operations on the data

Conflict means that two keywords correspond to the same function value , That is, the same address is mapped

To reduce conflict , Hash functions are also very particular . Try to be simple , The function range should be within its range , Also reduce their conflicts

Data structures are also mentioned in this book

  • The commonly used methods for constructing hash functions are : direct addressing ( According to the linear relationship of keywords )、 Take the mold 、 Digital analysis 、 Square middle method, etc
  • Common conflict resolution methods mainly include : Open address method ( When encountering conflict, it should detect in some way )、 Chain address method and so on

1.2 Bitmap method

B Use digit groups to indicate whether certain elements exist ( It can be used for duplicate checking , It can also be used to judge whether a certain data exists
Fast search for massive data 、 Weight determination 、 Delete etc. .

The specific operation is to generate N Bit character , If any number is marked as 1, No numbers are marked as 0.( Trade space for time , The time complexity of sorting is O(n))

 Insert picture description here

1.3 Bloom Filter

Check whether an element belongs to a collection . Sacrifice accuracy to improve space efficiency and time efficiency .

The basic principle is Digit group and Hash function The combination of , contain m Digit group ( Initialize to 0), Definition k Different hash function ( Each function maps to the... Of the bit set k A place ) A combination of . When inserting elements into a collection , according to k individual Hash Function to get the number of bits in a group k bits ( Set to 1), Query queries whether an element belongs to a collection ,k A for 1( There is ),k A bit is not for 1( non-existent ). When inserting other elements , Other locations may be 1, There was a mistake .

 Insert picture description here

1.4 Database optimization

You can use database tools 、 Data partition 、 Indexes 、 Caching mechanisms 、 Optimized query 、 Sort, etc

1.5 Inverted index

The index is built by sorting according to some values of keywords

The mapping of the storage location of a word in a document or a group of documents under full-text search , It is the most commonly used data structure in document retrieval system
There are two different forms of reverse indexing :
The first form is a horizontal reverse index of a record ( Or reverse file index ) A list of documents containing each reference word ;
The second form is the horizontal reverse index of a word ( Or completely reverse indexing ) It also contains the position of each word in a document .
The second form provides more compatibility ( For example, phrase search ), But it takes more time and space to create .

In general, matrix storage can be used to store , But it wastes a lot of space

1.6 External ranking method

Definition : You cannot process too many objects in memory at once , It must be put out in the form of documents . Sorting requires step-by-step memory processing ( Relatively large files , Cannot load into memory at one time , Data needs to be exchanged many times )

Merge sort or two-way merge sort can be adopted

It is applicable to sorting and duplicate checking of big data , But it's less efficient , Because we need to use it io The operation of

1.7 Dictionary tree

Use the common prefix of the string to reduce the time and space cost ( Space for time ). Multi tree structure for fast string retrieval

Count and sort a large number of strings ( But it's not just about strings ), So it is often used in text word frequency statistics by search engine system .

Advantage is : Minimize unnecessary string comparisons , Query efficiency is higher than hash table .

Trie Trees generally have 3 Basic characteristics :

  • The root node does not contain characters , Except the root node, each node contains only one character .
  • From the root node to a node , The characters passing through the path are concatenated , Is the string corresponding to the node .
  • All child nodes of each node contain different characters .

2. Analysis of classical problems

2.1 top k problem

The most frequent k Number or find the largest k Number, etc , At present, the better method is Divide and conquer +Trie Trees /hash + Small cap pile .

  1. hash Method is decomposed into multiple small data sets ( If there are duplicate numbers , Reduce a lot of data sets by de duplication )
  2. Trie Trees or hash Count the... In each small data set query Word frequency
  3. The small top stack calculates the highest frequency front in each data set K Number
  4. Finally, in all top K Find the final top K

in application , There may be enough memory , Then directly throw the data into the memory for one-time processing , It's also possible that the machine has multiple cores , This allows multithreading of the entire dataset .

2.2 Repetitive questions

This problem can be solved by bitmap method , This method is optimal
The specific execution code can be shown in the book , This is the display code of the book
 Insert picture description here
 Insert picture description here
 Insert picture description here

2.3 Scheduling problem

  • Database sorting ( The database specifications are required to be good )
  • Divide and conquer method ( Although the memory is reduced , But the coding is complicated , Slow down )
  • Bitmap method
原网站

版权声明
本文为[Menon research monk]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203012301379606.html