当前位置:网站首页>Massive data TOPK problem
Massive data TOPK problem
2022-07-27 22:11:00 【benym】
# Huge amounts of data TopK problem
In large-scale data processing , We often encounter such problems : Find the frequency of occurrence in massive data / Top with the largest value K Number
This article mainly provides the basic solutions to this kind of problem
Suppose such a scenario , The more you read a question , It shows that the more valuable this problem is , The more it should be pushed to users
Suppose the amount of data is 1 Billion , take Top100
- The easiest way to think of is to do all the data Sort , But if the amount of data is too large , This is obviously unacceptable .
- The second method is Partial elimination method , This method is similar to the sorting method , Before you save it in a container K Number , Then all the remaining numbers —— Compared to the smallest number in the container , If all subsequent elements are larger than the ones in the container K The number is still small , So this in the container K The number is the largest K Number . If a subsequent element is larger than the smallest number in the container , Delete the smallest element in the container , And insert the element into the container , Finally, I'll go through this 1 Million number , The number saved in the result container is the final result . The time complexity is O(n+m^2), among m Is the size of the container , namely K.
- The third way is Divide and conquer method , take 1 100 million data are divided into 100 Share , each 100 Ten thousand data , Find the largest of every piece of data 100 individual ( That is TopK), Finally, in the rest of 100*100 Find the biggest one in the data 100 individual . If 100 Ten thousand data selection is ideal enough , Then you can filter out 1 In 100 million data 99% The data of .100 Look for the largest among 10000 data 100 The way to get data is as follows : Using quick sort , Divide the data into 2 Pile up , If there's a big pile N Greater than 100 individual , Continue to quick sort the pile into 2 Pile up , If there's a big pile N Greater than 100 individual , Continue to quick sort the pile into 2 Pile up , If there's a lot of numbers N Less than 100 individual , It's a quick sort in the small pile , Find the first 100-n Big numbers ; Recursion of the above process , You can find number one 100 Large number . Refer to the above to find out 100 Big numbers , You can find it in a similar way 100 Big numbers .
- The fourth way is Hash Law . If this 1 There are many repeated numbers in 100 million numbers , Through the first Hash Law , Put this 1 Billion numbers to repeat , So if the repetition rate is very high , It reduces a lot of memory , So as to reduce the computing space , Then find the largest by divide and conquer or minimum heap method 100 Number .
- The fifth method adopts The smallest pile . First read before 100 Number to create a size of 100 The smallest pile , And then iterate through the following numbers , And on the top of the heap ( Minimum ) Compare the numbers . If it's smaller than the smallest number , Then continue to read the following numbers ; If it's bigger than the top number , Replace the top elements and readjust the heap to the minimum heap . The whole process until 1 All of them are traversed . Then output all of the current heap in the middle order traversal 100 A digital . The time complexity of reactor construction is O(m), The time complexity of heap adjustment is O(logm), The final time complexity =1 Time of secondary reactor building +n Time of secondary heap adjustment , So the time complexity of this algorithm is O(nlogm), The space complexity is 100( constant ).
Code example :
We can first use HashMap To store the mapping between questions and reading volume
// Pseudo code
String read;
HashMap<String, Integer> map = new HashMap<String, Integer>();
while (fileStream.hasNext()){
url = read();
if (map.containsKey(read)){
map.put(read,map.get(read)+1);
}else{
map.put(read, 1);
}
}after , We use a length of 100 Of The smallest pile To find the hottest 100 Data .
PriorityQueue<Map.Entry<String, Integer>> queue =
new PriorityQueue<Map.Entry<String, Integer>>(100,
new Comparator<Map.Entry<String, Integer>>(){
@Override public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) {
return o1.getValue() - o2.getValue();
}
});
int k = 100;
for (Map.Entry<String, Integer> entry : map.entrySet()){
if (queue.size() < k){
queue.add(entry);
} else {
if (entry.getValue() > queue.peek().getValue()){
queue.poll();
queue.add(entry);
}
}
}边栏推荐
- Small change project (two versions) with detailed ideas
- An article takes you into the world of pycharm - stop asking me about pycharm installation and environment configuration!!!
- Why do server programs need to listen first
- How to use Fiddler for weak network testing
- Simple use of enum
- Interview question: talk about your understanding of AQS
- Basic usage of two-dimensional array
- vs2019 release模式调试:此表达式有副作用,将不予计算。
- 排序(冒泡排序)后面学习持续更新其它排序方法
- [question 22] dungeons and Warriors (Beijing Institute of Technology / Beijing Institute of Technology / programming methods and practice / primary school)
猜你喜欢

7行代码让B站崩溃3小时

Log4j 漏洞仍普遍存在?

8000字讲透OBSA原理与应用实践

Deepfake 换脸真假难辨,马斯克分克已伪装成功

Inventory Poka ecological potential project | cross chain characteristics to promote the prosperity of multi track

Drawing three coordinate (axis) diagram with MATLAB

异常-Exception

IDEA常用快捷键及设置方法

零钱通项目(两个版本)含思路详解

The design idea of relational database is obvious to you in 20 pictures
随机推荐
【OBS】P B 丢帧阈值 buffer_duration_usec
JVM memory model interview summary
2021-11-05理解main方法语法、代码块及final关键字
Matlab 绘制风速、风向统计玫瑰花图
二维数组的基本用法
Shengyang technology officially launched the remote voiceprint health return visit service system!
Project analysis (from technology to project and product)
How to learn object Defineproperty | an article takes you to quickly learn
Enumeration and annotation
Cloud native microservices Chapter 3: haproxy+kept
[question 22] dungeons and Warriors (Beijing Institute of Technology / Beijing Institute of Technology / programming methods and practice / primary school)
Pythia: Facebook's latest open source visual and language multitasking learning framework
@Can component be used on the same class as @bean?
Software testing interview question: what is the focus of unit testing, integration testing, and system testing?
一种比读写锁更快的锁,还不赶紧认识一下
What is eplato cast by Plato farm on elephant swap? Why is there a high premium?
Behind every piece of information you collect, you can't live without TA
对象在内存中存在形式&内存分配机制
B站崩了,那晚负责修复的开发人员做了什么?
Nano semiconductor 65W gallium nitride (GAN) scheme was adopted by Xiaomi 10 Pro charger