当前位置：网站首页>Bron filter Course Research Report

Bron filter Course Research Report

2022-06-28 09:33:00 【Zhang 2 childe】

Introduction ： The content of this report is bron filter , A more comprehensive introduction to the bloan filter , From the background and significance 、 Algorithm description 、 Advantages and disadvantages 、 Describe the bloom filter from the perspective of application scenarios . And I also want to introduce the distributed bloom filter to you

1 The bloon filter

1.1 What is a bloon filter

Bloom filter is a space saving probabilistic data structure , It can be used for various problems . They have been successfully used in web cache 、 Fully decentralized computation of aggregate functions 、 Efficient memory genome assembly 、 Collection Coordination in distributed systems 、 More efficient block propagation in blockchain architecture and many other aspects . It's actually a very long binary vector and a series of random mapping functions , The bloom filter can be used to retrieve whether an element is in a collection .
Although there are many variants of the bloom filter , But most have the same core idea ： Bloom filter can quickly verify whether an item exists in a set with the lowest space requirements . This is achieved by sacrificing some precision to get less space . for example , When checking the presence of elements in the bloom filter , You may get some false positive matches , But never get a false negative match . let me put it another way , The bloom filter can prove that an item is not in the set , Or it might be in a collection .

1.2 Background and significance

In real business , We will encounter many business scenarios to determine whether an element is in a collection , The common solution is to save all the elements in the collection , And then through comparison to determine . Linked list 、 Trees 、 Hash tables and other data structures are all based on this idea . But as the number of elements in the collection increases , The amount of storage we need will grow linearly , Finally, the bottleneck . At the same time, the retrieval speed is getting slower and slower , The retrieval time complexity of the above three structures are respectively $O (n)$ , $O (l o g n)$ , $O (1)$ , This is the time , The bloon filter （Bloom Filter） It came into being .
Its essence is to use multiple hash functions , Map a data into a bitmap structure . This method can not only improve the query efficiency , It can also save a lot of memory space .

1.3 Bloom filter algorithm principle

To better understand how this works , Let's look at the algorithm ：
1、 Defined k A separate hash function （ among “k” Is the hash function used ）
2、 Defined a m Bit long zero bit array .
3、 Direction bloom When a filter inserts an element , We hash the elements k Time , As defined in step 1 . Each hash value is used to point to the index of the zero bit array （ Steps in 2 Definition ）. The bits at these indexes are then shifted from 0 Switch to 1, Pictured 1 Shown .
Insert picture description here

1.3.1 Insert search （ There is no false report ）

Suppose we were bloom Several elements are inserted into the filter , Now let's check if there are any specific elements . So , We just hash the elements k Times and find the given index . If bloom In the filter k One of the indexes has a bit of zero , We can come to a conclusion , The given element was never inserted bloom filter . let me put it another way , We can always know if an element is not in the collection , namely bloom There will be no false positives in the filter .

1.3.2 Check if the collection exists （ It's possible to misreport ）

When we want to check whether there are elements in the collection , It's a little different . This is a query that may lead to false positives in order to better understand why this happens , Let's look at the picture 2.
Insert picture description here

We have elements Y and Z And three hash functions to map values to bloom filter . Let's assume that Y The result hash of is {11,6,1},Z The result hash of is {5,2,9}. then , We switch all bits of a given index to one bit . If someone wants to verify Y or Z Whether it really exists in the set , You will get a positive result , This is because Y and Z All indexes of are one . However , There's a problem . If there is another element , We call it X, It happens to have {1,5,2} As a hash , Then you will get a false positive .{1,5,2} The position at is already 1, But we never will X Insert bloom filter . therefore , We don't know X Whether it is really inserted into bloom In the filter .

1.4 Advantages and disadvantages

The bloom filter has the following advantages ： Because it stores binary data , So it takes up very little space ; Its insertion and query speed is very fast , The time complexity is O（K）; And the confidentiality is very good , Because it doesn't store any raw data , Only binary data . People can use hash tables to do the same thing , Without the need for bloom Probability characteristics of the filter , but bloom The minimum space requirement provided by the filter makes it a useful data structure for many problems .
But there are also some disadvantages ： Because this is a probabilistic data structure , May not be accurate , It can only be determined that an element must not exist or may exist , It is not certain that an element is really difficult to delete , Due to the mapping K Multiple elements may be shared in a point , In the process of deleting , Changes in these points may involve changes in other values , So it's not easy to delete

2 Distributed bloom filter

2.1 Unique bloom filter mapping

Distributed bloom filter is a probabilistic data structure , It is suitable for distributed systems that need to be synchronized quickly in a way that saves space and time . The way to do this is very simple ： Each node interacts with another node , Will send a with a unique mapping bloom filter , Instead of sending the same... To different nodes bloom filter . When we do this , The probability of an element being copied to another node changes slightly

2.2 Probability of missing elements in distributed structured Bloom filters

Now? , We calculated N Probability of nodes not distinguishing missing elements ：
Insert picture description here

N In this case, the number of other nodes being contacted . This little change has some interesting implications . our bloom The filter is suddenly allowed to have a high false positive rate , This means that relative to the set n Size ,bloom The size of the filter m Much smaller . When we use different mappings bloom Filter contacts others n When a node , The false positive hit rate of all nodes to elements decreases with each added node .

reference

[1] Lum Ramabaja,Arber Avdullahu The Distributed Bloom Filter 2019,October
[2] D. Guo, M. Li, Set Reconciliation via Counting Bloom Filters, IEEE Transactions on Knowledge and Data Engineering 25 (2013) 2367–2380.
[3] M. T. Goodrich, M. Mitzenmacher, Invertible Bloom Lookup Tables (2011).

原网站

版权声明
本文为[Zhang 2 childe]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/179/202206280922396237.html