当前位置:网站首页>Bron filter Course Research Report
Bron filter Course Research Report
2022-06-28 09:33:00 【Zhang 2 childe】
Introduction : The content of this report is bron filter , A more comprehensive introduction to the bloan filter , From the background and significance 、 Algorithm description 、 Advantages and disadvantages 、 Describe the bloom filter from the perspective of application scenarios . And I also want to introduce the distributed bloom filter to you
1 The bloon filter
1.1 What is a bloon filter
Bloom filter is a space saving probabilistic data structure , It can be used for various problems . They have been successfully used in web cache 、 Fully decentralized computation of aggregate functions 、 Efficient memory genome assembly 、 Collection Coordination in distributed systems 、 More efficient block propagation in blockchain architecture and many other aspects . It's actually a very long binary vector and a series of random mapping functions , The bloom filter can be used to retrieve whether an element is in a collection .
Although there are many variants of the bloom filter , But most have the same core idea : Bloom filter can quickly verify whether an item exists in a set with the lowest space requirements . This is achieved by sacrificing some precision to get less space . for example , When checking the presence of elements in the bloom filter , You may get some false positive matches , But never get a false negative match . let me put it another way , The bloom filter can prove that an item is not in the set , Or it might be in a collection .
1.2 Background and significance
In real business , We will encounter many business scenarios to determine whether an element is in a collection , The common solution is to save all the elements in the collection , And then through comparison to determine . Linked list 、 Trees 、 Hash tables and other data structures are all based on this idea . But as the number of elements in the collection increases , The amount of storage we need will grow linearly , Finally, the bottleneck . At the same time, the retrieval speed is getting slower and slower , The retrieval time complexity of the above three structures are respectively O ( n ) O(n) O(n), O ( l o g n ) O(logn) O(logn), O ( 1 ) O(1) O(1), This is the time , The bloon filter (Bloom Filter) It came into being .
Its essence is to use multiple hash functions , Map a data into a bitmap structure . This method can not only improve the query efficiency , It can also save a lot of memory space .
1.3 Bloom filter algorithm principle
To better understand how this works , Let's look at the algorithm :
1、 Defined k A separate hash function ( among “k” Is the hash function used )
2、 Defined a m Bit long zero bit array .
3、 Direction bloom When a filter inserts an element , We hash the elements k Time , As defined in step 1 . Each hash value is used to point to the index of the zero bit array ( Steps in 2 Definition ). The bits at these indexes are then shifted from 0 Switch to 1, Pictured 1 Shown .
1.3.1 Insert search ( There is no false report )
Suppose we were bloom Several elements are inserted into the filter , Now let's check if there are any specific elements . So , We just hash the elements k Times and find the given index . If bloom In the filter k One of the indexes has a bit of zero , We can come to a conclusion , The given element was never inserted bloom filter . let me put it another way , We can always know if an element is not in the collection , namely bloom There will be no false positives in the filter .
1.3.2 Check if the collection exists ( It's possible to misreport )
When we want to check whether there are elements in the collection , It's a little different . This is a query that may lead to false positives in order to better understand why this happens , Let's look at the picture 2.
We have elements Y and Z And three hash functions to map values to bloom filter . Let's assume that Y The result hash of is {11,6,1},Z The result hash of is {5,2,9}. then , We switch all bits of a given index to one bit . If someone wants to verify Y or Z Whether it really exists in the set , You will get a positive result , This is because Y and Z All indexes of are one . However , There's a problem . If there is another element , We call it X, It happens to have {1,5,2} As a hash , Then you will get a false positive .{1,5,2} The position at is already 1, But we never will X Insert bloom filter . therefore , We don't know X Whether it is really inserted into bloom In the filter .
1.4 Advantages and disadvantages
The bloom filter has the following advantages : Because it stores binary data , So it takes up very little space ; Its insertion and query speed is very fast , The time complexity is O(K); And the confidentiality is very good , Because it doesn't store any raw data , Only binary data . People can use hash tables to do the same thing , Without the need for bloom Probability characteristics of the filter , but bloom The minimum space requirement provided by the filter makes it a useful data structure for many problems .
But there are also some disadvantages : Because this is a probabilistic data structure , May not be accurate , It can only be determined that an element must not exist or may exist , It is not certain that an element is really difficult to delete , Due to the mapping K Multiple elements may be shared in a point , In the process of deleting , Changes in these points may involve changes in other values , So it's not easy to delete
2 Distributed bloom filter
2.1 Unique bloom filter mapping
Distributed bloom filter is a probabilistic data structure , It is suitable for distributed systems that need to be synchronized quickly in a way that saves space and time . The way to do this is very simple : Each node interacts with another node , Will send a with a unique mapping bloom filter , Instead of sending the same... To different nodes bloom filter . When we do this , The probability of an element being copied to another node changes slightly
2.2 Probability of missing elements in distributed structured Bloom filters
Now? , We calculated N Probability of nodes not distinguishing missing elements :
N In this case, the number of other nodes being contacted . This little change has some interesting implications . our bloom The filter is suddenly allowed to have a high false positive rate , This means that relative to the set n Size ,bloom The size of the filter m Much smaller . When we use different mappings bloom Filter contacts others n When a node , The false positive hit rate of all nodes to elements decreases with each added node .
reference
[1] Lum Ramabaja,Arber Avdullahu The Distributed Bloom Filter 2019,October
[2] D. Guo, M. Li, Set Reconciliation via Counting Bloom Filters, IEEE Transactions on Knowledge and Data Engineering 25 (2013) 2367–2380.
[3] M. T. Goodrich, M. Mitzenmacher, Invertible Bloom Lookup Tables (2011).
边栏推荐
猜你喜欢

File operations in QT

1. Kimball dimension modeling of data warehouse: what is a fact table?

Key summary V of PMP examination - execution process group

Deployment of MySQL database in Linux Environment

P2394 yyy loves Chemistry I

A classic JVM class loaded interview question class singleton{static singleton instance = new singleton(); private singleton() {}

Full link service tracking implementation scheme

1182:合影效果

Ingersoll Rand panel maintenance IR Ingersoll Rand microcomputer controller maintenance xe-145m

为什么SELECT * 会导致查询效率低?
随机推荐
Xiaomi's payment company was fined 120000 yuan, involving the illegal opening of payment accounts, etc.: Lei Jun is the legal representative, and the products include MIUI wallet app
Abnormal occurrence and solution
Illustration of MySQL binlog, redo log and undo log
Full link service tracking implementation scheme
数据挖掘建模实战
Use of Jasper soft studio report tool and solution of thorny problems
大纲笔记软件 Workflowy 综合评测:优点、缺点和评价
Do static code blocks always execute first? The pattern is smaller!!!
Screen settings in the source code of OBS Live Room
理解IO模型
Learn how Alibaba manages the data indicator system
Why does select * lead to low query efficiency?
Common tools for interface testing --postman
Linux下安装redis 、Windows下安装redis(超详细图文教程)
Ingersoll Rand面板维修IR英格索兰微电脑控制器维修XE-145M
DolphinScheduler使用系统时间
Is it safe for Huatai Securities to open an account online? What is the handling process
Data modeling based on wide table
Edit the live broadcast source code with me. How to write the live broadcast app code
P2394 yyy loves Chemistry I