当前位置：网站首页>Analysis of high frequency interview questions in massive data processing

Analysis of high frequency interview questions in massive data processing

2022-06-11 09:24:00 【Menon research monk】

Catalog

Preface
The basic method
2. Analysis of classical problems

Preface

Hardware expansion is difficult to meet the needs of massive data processing , How to use the existing conditions for massive information processing
Massive information processing has increasingly become a new highlight in the current written interview for programmers
The main reference books are ：《java Programmer interview book 》

The basic method

By querying all aspects of online knowledge points , And read some related books

Common methods are Hash Law 、Bit - map Law 、Bloom filter Law 、 Database optimization method 、 Inverted index 、 External ranking method 、Trie Trees 、 Pile up 、 Double bucket method and MapRe-duce FA et al

1.1 The hash algorithm

Also known as hashing （ The mapping relationship ）. The key in the data element is key, Calculate by hash function hash ( key), That is to calculate its storage address , So as to perform some operations on the data

Conflict means that two keywords correspond to the same function value , That is, the same address is mapped

To reduce conflict , Hash functions are also very particular . Try to be simple , The function range should be within its range , Also reduce their conflicts

Data structures are also mentioned in this book

The commonly used methods for constructing hash functions are ： direct addressing （ According to the linear relationship of keywords ）、 Take the mold 、 Digital analysis 、 Square middle method, etc
Common conflict resolution methods mainly include ： Open address method （ When encountering conflict, it should detect in some way ）、 Chain address method and so on

1.2 Bitmap method

B Use digit groups to indicate whether certain elements exist （ It can be used for duplicate checking , It can also be used to judge whether a certain data exists ）
Fast search for massive data 、 Weight determination 、 Delete etc. .

The specific operation is to generate N Bit character , If any number is marked as 1, No numbers are marked as 0.（ Trade space for time , The time complexity of sorting is O（n））

Insert picture description here

1.3 Bloom Filter

Check whether an element belongs to a collection . Sacrifice accuracy to improve space efficiency and time efficiency .

The basic principle is Digit group and Hash function The combination of , contain m Digit group （ Initialize to 0）, Definition k Different hash function （ Each function maps to the... Of the bit set k A place ） A combination of . When inserting elements into a collection , according to k individual Hash Function to get the number of bits in a group k bits （ Set to 1）, Query queries whether an element belongs to a collection ,k A for 1（ There is ）,k A bit is not for 1（ non-existent ）. When inserting other elements , Other locations may be 1, There was a mistake .

Insert picture description here

1.4 Database optimization

You can use database tools 、 Data partition 、 Indexes 、 Caching mechanisms 、 Optimized query 、 Sort, etc

1.5 Inverted index

The index is built by sorting according to some values of keywords

The mapping of the storage location of a word in a document or a group of documents under full-text search , It is the most commonly used data structure in document retrieval system
There are two different forms of reverse indexing :
The first form is a horizontal reverse index of a record ( Or reverse file index ) A list of documents containing each reference word ;
The second form is the horizontal reverse index of a word ( Or completely reverse indexing ) It also contains the position of each word in a document .
The second form provides more compatibility （ For example, phrase search ), But it takes more time and space to create .

In general, matrix storage can be used to store , But it wastes a lot of space

1.6 External ranking method

Definition ： You cannot process too many objects in memory at once , It must be put out in the form of documents . Sorting requires step-by-step memory processing （ Relatively large files , Cannot load into memory at one time , Data needs to be exchanged many times ）

Merge sort or two-way merge sort can be adopted

It is applicable to sorting and duplicate checking of big data , But it's less efficient , Because we need to use it io The operation of

1.7 Dictionary tree

Use the common prefix of the string to reduce the time and space cost （ Space for time ）. Multi tree structure for fast string retrieval

Count and sort a large number of strings ( But it's not just about strings ), So it is often used in text word frequency statistics by search engine system .

Advantage is : Minimize unnecessary string comparisons , Query efficiency is higher than hash table .

Trie Trees generally have 3 Basic characteristics :

The root node does not contain characters , Except the root node, each node contains only one character .
From the root node to a node , The characters passing through the path are concatenated , Is the string corresponding to the node .
All child nodes of each node contain different characters .

2. Analysis of classical problems

2.1 top k problem

The most frequent k Number or find the largest k Number, etc , At present, the better method is Divide and conquer +Trie Trees /hash + Small cap pile .

hash Method is decomposed into multiple small data sets （ If there are duplicate numbers , Reduce a lot of data sets by de duplication ）
Trie Trees or hash Count the... In each small data set query Word frequency
The small top stack calculates the highest frequency front in each data set K Number
Finally, in all top K Find the final top K

in application , There may be enough memory , Then directly throw the data into the memory for one-time processing , It's also possible that the machine has multiple cores , This allows multithreading of the entire dataset .

2.2 Repetitive questions

This problem can be solved by bitmap method , This method is optimal
The specific execution code can be shown in the book , This is the display code of the book
Insert picture description here

2.3 Scheduling problem

Database sorting （ The database specifications are required to be good ）
Divide and conquer method （ Although the memory is reduced , But the coding is complicated , Slow down ）
Bitmap method

原网站

版权声明
本文为[Menon research monk]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203012301379606.html