当前位置:网站首页>Top k questions of interview
Top k questions of interview
2022-07-03 16:23:00 【Zuo Haifeng blog】
Look up the first large or small data from a large amount of data .
First example : Yes 4g Memory , Yes 1 Ten million ipv4 The address of . Find the top one that appears most often 100 individual . set up ipv4 Address 192.168.111.222. In string 15 individual char, Every char Two bytes . Every address 30 Byte storage .1 It must be 3 Gigabytes . 3 Gigabytes = 0.3g. So there is enough memory .ipv4 If int Storage . Smaller memory .
1. On the premise of sufficient memory :
First, make statistics . Reordering .
Make a k,v Structure . every last ip How many times . Can pass hashmap Or array mapping . Sort to find the maximum or minimum after statistics .( The sort here is different from the heap sort ) For example, before looking for 100 Big , Let's assume that hashmap Or the first in the array 100 As the biggest , Just record the smallest one here . Then compare with the new number . I'll leave the record and throw it away , It's too small . Reduce complexity with the help of top heap . The top 100 use small top piles , The first 100 hours use the big top pile . For example, before looking for 100 Big , You can create a 100 A small top pile of nodes , When maintaining the small top heap, it is different from the heap sorting . Heap sorting is to construct all nodes into a heap , After the maintenance, it exchanges the position of the top of the heap and the last node every time . Remove the last node after exchange . The sorting here is to maintain a 100 Heap of nodes . The size of this heap will not change . Then it's the one we count hashmap Or arrays . Put the front first 100 One approximation is regarded as making a small top pile with the most occurrences . Then start traversing the 101 individual , Compare with the top of the pile every time . If it's bigger than the top of the pile . Exchange with the top of the stack . Maintain the small top pile after the exchange . Maintain it well, then compare it 102 Number . In this way, we can get the former 100 The most frequent heap . Complexity nlogm,n On behalf of this 1 Ten million ,m On behalf of 100.
2. Out of memory :
Give it to 100g The text of . The text appears in the following two columns , One is ip Address , One is url. find url By different ip The number of visits before 100 Much of the . Give it to 4g Machine , Unlimited hard disk .
1.2.3.4 baidu.com
1.2.3.4 taobao.com
1.2.3.3 taobao.com
1.2.2.2 baidu.com
1.2.2.2 baidu.com
Here is the baidu yes 2 individual ,taobao It's also 2 individual .
1. Are there any more machines . It can be processed by multiple machines , Each machine handles a portion .
2. Borrow existing software, such as mysql. adopt sql Statement to find out .
3. Sort with hard disk . Name the file according to the domain name . Visit this domain name ip Put in , Or use hash, Check the domain name n Seeking remainder , Put the same domain name data into the same text for statistics . Read files during statistics , There is no need to put all the files in memory , Read a file and make the domain name independent ip Count the number of . After the statistics are completed, the top pile calculation can be used .
边栏推荐
- 2022爱分析· 国央企数字化厂商全景报告
- Low level version of drawing interface (explain each step in detail)
- 切入点表达式
- Pointcut expression
- Famous blackmail software stops operation and releases decryption keys. Most hospital IOT devices have security vulnerabilities | global network security hotspot on February 14
- Characteristic polynomial and constant coefficient homogeneous linear recurrence
- 为抵制 7-Zip,列出 “三宗罪” ?网友:“第3个才是重点吧?”
- [combinatorics] summary of combinatorial identities (eleven combinatorial identities | proof methods of combinatorial identities | summation methods)*
- Embedded development: seven reasons to avoid open source software
- 相同切入点的抽取
猜你喜欢

探索Cassandra的去中心化分布式架构

Rk3399 platform development series explanation (WiFi) 5.54. What is WiFi wireless LAN

深入理解 SQL 中的 Grouping Sets 语句

远程文件包含实操

Asemi rectifier bridge umb10f parameters, umb10f specifications, umb10f package
![[redis foundation] understand redis persistence mechanism together (rdb+aof graphic explanation)](/img/68/3721975cf33fcfacc28dc4d3d6a5ca.jpg)
[redis foundation] understand redis persistence mechanism together (rdb+aof graphic explanation)

Cocos Creator 2.x 自动打包(构建 + 编译)

QT串口ui设计和解决显示中文乱码

“用Android复刻Apple产品UI”(2)——丝滑的AppStore卡片转场动画

Interviewer: how does the JVM allocate and recycle off heap memory
随机推荐
Threejs Part 2: vertex concept, geometry structure
0214-27100 a day with little fluctuation
Qt插件之自定义插件构建和使用
Hibernate的缓存机制/会话级缓存机制
Characteristic polynomial and constant coefficient homogeneous linear recurrence
How to thicken the brush in the graphical interface
ASEMI整流桥UMB10F参数,UMB10F规格,UMB10F封装
Using optimistic lock and pessimistic lock in MySQL to realize distributed lock
June to - -------
NFT new opportunity, multimedia NFT aggregation platform okaleido will be launched soon
Pointcut expression
[proteus simulation] 74hc595+74ls154 drive display 16x16 dot matrix
拼夕夕二面:说说布隆过滤器与布谷鸟过滤器?应用场景?我懵了。。
面试之 top k问题
App mobile terminal test [4] APK operation
The difference between calling by value and simulating calling by reference
First!! Is lancet hungry? Official documents
Détails du contrôle de la congestion TCP | 3. Espace de conception
Colab works with Google cloud disk
六月 致 -.-- -..- -