当前位置:网站首页>Top k questions of interview
Top k questions of interview
2022-07-03 16:23:00 【Zuo Haifeng blog】
Look up the first large or small data from a large amount of data .
First example : Yes 4g Memory , Yes 1 Ten million ipv4 The address of . Find the top one that appears most often 100 individual . set up ipv4 Address 192.168.111.222. In string 15 individual char, Every char Two bytes . Every address 30 Byte storage .1 It must be 3 Gigabytes . 3 Gigabytes = 0.3g. So there is enough memory .ipv4 If int Storage . Smaller memory .
1. On the premise of sufficient memory :
First, make statistics . Reordering .
Make a k,v Structure . every last ip How many times . Can pass hashmap Or array mapping . Sort to find the maximum or minimum after statistics .( The sort here is different from the heap sort ) For example, before looking for 100 Big , Let's assume that hashmap Or the first in the array 100 As the biggest , Just record the smallest one here . Then compare with the new number . I'll leave the record and throw it away , It's too small . Reduce complexity with the help of top heap . The top 100 use small top piles , The first 100 hours use the big top pile . For example, before looking for 100 Big , You can create a 100 A small top pile of nodes , When maintaining the small top heap, it is different from the heap sorting . Heap sorting is to construct all nodes into a heap , After the maintenance, it exchanges the position of the top of the heap and the last node every time . Remove the last node after exchange . The sorting here is to maintain a 100 Heap of nodes . The size of this heap will not change . Then it's the one we count hashmap Or arrays . Put the front first 100 One approximation is regarded as making a small top pile with the most occurrences . Then start traversing the 101 individual , Compare with the top of the pile every time . If it's bigger than the top of the pile . Exchange with the top of the stack . Maintain the small top pile after the exchange . Maintain it well, then compare it 102 Number . In this way, we can get the former 100 The most frequent heap . Complexity nlogm,n On behalf of this 1 Ten million ,m On behalf of 100.
2. Out of memory :
Give it to 100g The text of . The text appears in the following two columns , One is ip Address , One is url. find url By different ip The number of visits before 100 Much of the . Give it to 4g Machine , Unlimited hard disk .
1.2.3.4 baidu.com
1.2.3.4 taobao.com
1.2.3.3 taobao.com
1.2.2.2 baidu.com
1.2.2.2 baidu.com
Here is the baidu yes 2 individual ,taobao It's also 2 individual .
1. Are there any more machines . It can be processed by multiple machines , Each machine handles a portion .
2. Borrow existing software, such as mysql. adopt sql Statement to find out .
3. Sort with hard disk . Name the file according to the domain name . Visit this domain name ip Put in , Or use hash, Check the domain name n Seeking remainder , Put the same domain name data into the same text for statistics . Read files during statistics , There is no need to put all the files in memory , Read a file and make the domain name independent ip Count the number of . After the statistics are completed, the top pile calculation can be used .
边栏推荐
- Page dynamics [2]keyframes
- [proteus simulation] 8 × 8LED dot matrix screen imitates elevator digital scrolling display
- Extraction of the same pointcut
- Colab works with Google cloud disk
- Register in PHP_ Globals parameter settings
- Why can't strings be directly compared with equals; Why can't some integers be directly compared with the equal sign
- pyinstaller不是内部或外部命令,也不是可运行的程序 或批处理文件
- MongoDB 的安装和基本操作
- 初试scikit-learn库
- 关于视觉SLAM的最先进技术的调查-A survey of state-of-the-art on visual SLAM
猜你喜欢

"Remake Apple product UI with Android" (3) - elegant statistical chart

深入理解 SQL 中的 Grouping Sets 语句

Break through 1million, sword finger 2million!

Embedded development: seven reasons to avoid open source software

Basis of target detection (IOU)

2022爱分析· 国央企数字化厂商全景报告

【Proteus仿真】8×8LED点阵屏仿电梯数字滚动显示

嵌入式开发:避免开源软件的7个理由

拼夕夕二面:说说布隆过滤器与布谷鸟过滤器?应用场景?我懵了。。

Nifi from introduction to practice (nanny level tutorial) - flow
随机推荐
NFT新的契机,多媒体NFT聚合平台OKALEIDO即将上线
Is it safe to open an account with tongdaxin?
Client does not support authentication protocol requested by server; consider upgrading MySQL client
Break through 1million, sword finger 2million!
NSQ源码安装运行过程
【Proteus仿真】74HC595+74LS154驱动显示16X16点阵
Construction practice camp - graduation summary of phase 6
8个酷炫可视化图表,快速写出老板爱看的可视化分析报告
TCP congestion control details | 3 design space
面试之 top k问题
于文文、胡夏等明星带你玩转派对 皮皮APP点燃你的夏日
[proteus simulation] 74hc595+74ls154 drive display 16x16 dot matrix
Expression of request header in different countries and languages
程序猿如何快速成长
Explore Netease's large-scale automated testing solutions see here see here
相同切入点的抽取
The mixlab editing team is recruiting teammates~~
Slam learning notes - build a complete gazebo multi machine simulation slam from scratch (I)
Initial test of scikit learn Library
Stm32f103c8t6 firmware library lighting