当前位置:网站首页>Top k questions of interview
Top k questions of interview
2022-07-03 16:23:00 【Zuo Haifeng blog】
Look up the first large or small data from a large amount of data .
First example : Yes 4g Memory , Yes 1 Ten million ipv4 The address of . Find the top one that appears most often 100 individual . set up ipv4 Address 192.168.111.222. In string 15 individual char, Every char Two bytes . Every address 30 Byte storage .1 It must be 3 Gigabytes . 3 Gigabytes = 0.3g. So there is enough memory .ipv4 If int Storage . Smaller memory .
1. On the premise of sufficient memory :
First, make statistics . Reordering .
Make a k,v Structure . every last ip How many times . Can pass hashmap Or array mapping . Sort to find the maximum or minimum after statistics .( The sort here is different from the heap sort ) For example, before looking for 100 Big , Let's assume that hashmap Or the first in the array 100 As the biggest , Just record the smallest one here . Then compare with the new number . I'll leave the record and throw it away , It's too small . Reduce complexity with the help of top heap . The top 100 use small top piles , The first 100 hours use the big top pile . For example, before looking for 100 Big , You can create a 100 A small top pile of nodes , When maintaining the small top heap, it is different from the heap sorting . Heap sorting is to construct all nodes into a heap , After the maintenance, it exchanges the position of the top of the heap and the last node every time . Remove the last node after exchange . The sorting here is to maintain a 100 Heap of nodes . The size of this heap will not change . Then it's the one we count hashmap Or arrays . Put the front first 100 One approximation is regarded as making a small top pile with the most occurrences . Then start traversing the 101 individual , Compare with the top of the pile every time . If it's bigger than the top of the pile . Exchange with the top of the stack . Maintain the small top pile after the exchange . Maintain it well, then compare it 102 Number . In this way, we can get the former 100 The most frequent heap . Complexity nlogm,n On behalf of this 1 Ten million ,m On behalf of 100.
2. Out of memory :
Give it to 100g The text of . The text appears in the following two columns , One is ip Address , One is url. find url By different ip The number of visits before 100 Much of the . Give it to 4g Machine , Unlimited hard disk .
1.2.3.4 baidu.com
1.2.3.4 taobao.com
1.2.3.3 taobao.com
1.2.2.2 baidu.com
1.2.2.2 baidu.com
Here is the baidu yes 2 individual ,taobao It's also 2 individual .
1. Are there any more machines . It can be processed by multiple machines , Each machine handles a portion .
2. Borrow existing software, such as mysql. adopt sql Statement to find out .
3. Sort with hard disk . Name the file according to the domain name . Visit this domain name ip Put in , Or use hash, Check the domain name n Seeking remainder , Put the same domain name data into the same text for statistics . Read files during statistics , There is no need to put all the files in memory , Read a file and make the domain name independent ip Count the number of . After the statistics are completed, the top pile calculation can be used .
边栏推荐
- “用Android复刻Apple产品UI”(2)——丝滑的AppStore卡片转场动画
- 关于视觉SLAM的最先进技术的调查-A survey of state-of-the-art on visual SLAM
- [combinatorics] combinatorial identity (sum of variable upper terms 1 combinatorial identity | summary of three combinatorial identity proof methods | proof of sum of variable upper terms 1 combinator
- 8个酷炫可视化图表,快速写出老板爱看的可视化分析报告
- 如何在本机搭建SVN服务器
- 相同切入点的抽取
- [combinatorics] summary of combinatorial identities (eleven combinatorial identities | proof methods of combinatorial identities | summation methods)*
- How can technology managers quickly improve leadership?
- uploads-labs靶场(附源码分析)(更新中)
- PHP二级域名session共享方案
猜你喜欢
为抵制 7-Zip,列出 “三宗罪” ?网友:“第3个才是重点吧?”
2022爱分析· 国央企数字化厂商全景报告
关于视觉SLAM的最先进技术的调查-A survey of state-of-the-art on visual SLAM
uploads-labs靶场(附源码分析)(更新中)
Multithread 02 thread join
[proteus simulation] 74hc595+74ls154 drive display 16x16 dot matrix
Slam learning notes - build a complete gazebo multi machine simulation slam from scratch (4)
Redis installation under windows and Linux systems
线程池执行定时任务
The accept attribute of the El upload upload component restricts the file type (detailed explanation of the case)
随机推荐
How to thicken the brush in the graphical interface
Batch files: list all files in a directory with relative paths - batch files: list all files in a directory with relative paths
Mysql 单表字段重复数据取最新一条sql语句
The mixlab editing team is recruiting teammates~~
关于视觉SLAM的最先进技术的调查-A survey of state-of-the-art on visual SLAM
用通达信炒股开户安全吗?
App mobile terminal test [4] APK operation
“用Android复刻Apple产品UI”(3)—优雅的数据统计图表
Embedded development: seven reasons to avoid open source software
From "zero sum game" to "positive sum game", PAAS triggered the third wave of cloud computing
NFT新的契机,多媒体NFT聚合平台OKALEIDO即将上线
[combinatorics] non descending path problem (outline of non descending path problem | basic model of non descending path problem | non descending path problem expansion model 1 non origin starting poi
Nifi from introduction to practice (nanny level tutorial) - flow
Interviewer: how does the JVM allocate and recycle off heap memory
Myopia: take off or match glasses? These problems must be understood clearly first
uploads-labs靶场(附源码分析)(更新中)
SVN使用规范
MB10M-ASEMI整流桥MB10M
[proteus simulation] 8 × 8LED dot matrix screen imitates elevator digital scrolling display
Construction practice camp - graduation summary of phase 6