当前位置：网站首页>Redis bloom filter

Redis bloom filter

2022-07-02 12:37:00 【Squat in the corner and count the ants】

What is a bloon filter

In essence, the bloom filter is a data structure , A more ingenious probabilistic data structure （probabilistic data structure）, Features efficient insertion and query , It can be used to tell you “ Something must not exist or may exist ”.

Compared to traditional List、Set、Map And so on , It's more efficient 、 Take up less space , But the disadvantage is that the returned result is probabilistic , Not exactly .

Realization principle

HashMap The problem of

Before we talk about the principle of Bloom filter , Let's think about , Usually you judge whether an element exists by what ？ Many people should answer HashMap Well , You can actually map values to HashMap Of Key, Then you can go to O(1) Return results within the time complexity of , Extremely efficient . however HashMap The implementation also has disadvantages , For example, the proportion of storage capacity is high , Considering the existence of load factor , Usually space cannot be used up , And once you are worth a lot, such as hundreds of millions , that HashMap The amount of memory occupied becomes considerable .

For example, your data set is stored on a remote server , Local services accept input , The data set is too large to be read into memory at one time HashMap When , There will be problems .

Data structure of Bloom filter

The bloom filter is a bit Vector or bit Array , Long like this ：

If we want to map a value to a bloom filter , We need to use Multiple different hash functions Generate Multiple hash values , And for each generated hash value that points to bit Location 1, For example, for values “baidu” And three different hash functions generate hash values 1、4、7, Then the figure above changes to ：

Ok, Let's save another value now “tencent”, If the hash function returns 3、4、8 Words , Figure goes on to ：

It is worth noting that ,4 This bit Bit because the hash function for both values returns this bit position , So it's covered . Now if we want to check “dianping” Does this value exist , Hash function returned 1、5、8 Three values , As a result, we found that 5 This bit The value on bit is 0, It means that no value is mapped to this bit On a , So we can say for sure “dianping” This value does not exist . And when we need to query “baidu” If this value exists , The hash function must return 1、4、7, And then we checked and found these three bit All values on the bit are 1, So we can say “baidu” Does it exist ？ The answer is no , Can only be “baidu” This value may exist .

Why is that ？ Answer and simplicity , Because as the value increases, more and more , Be set to 1 Of bit More and more , Such a value “taobao” Even if it's not stored , But in case the hash function returns three bit Bits are set by other values 1 , Then the program will still judge “taobao” This value exists .

Do you support deletion

Thanks for the reminder in the comment area , Traditional Bloom filters don't support delete operations . But it's called Counting Bloom filter Can be used to test whether the number of element counts is absolutely less than a certain threshold , It supports element deletion . You can refer to the article Counting Bloom Filter Principle and implementation of

How to choose the number of hash functions and the length of Bloom filter

Obviously , Too small a bloon filter will soon all bit All positions are 1, Then any value queried will return “ Possible ”, Can't afford to filter . The length of the bloom filter directly affects the false alarm rate , The longer the bloom filter, the lower the false alarm rate .

in addition , The number of hash functions also needs to be weighed , The more the number, the more the bloom filter bit Position position 1 The faster , And the lower the efficiency of the bloom filter ; But if it's too little , Then our false alarm rate will be higher .

k Is the number of hash functions ,m Is the length of the bloom filter ,n Number of elements inserted for ,p Is the false positive rate

How to choose the right one for your business k and m Is it worth it , Here's a formula ：

How to deduce this formula is just a sentence , Because it doesn't make much sense for use , If you ask a high school student to push, you will push quickly .k One time hash function bit Bit is not set to 1 The probability of is ：

Insert n After three elements, it is still 0 The sum of the probabilities of is 1 The probabilities are ：

Indicates whether an element is required in the collection k Both positions are set to... As described above 1, However, this method may make the algorithm mistakenly think that an element that is not originally in the set is detected as in the set （False Positives）, The probability is determined by the following formula

Best practices

Common applications are , Use a bloom filter to reduce disks IO Or network request , Because once a value must not exist , We don't have to make subsequent expensive query requests .

in addition , Since you use the bloom filter to speed up the search and determine whether it exists , Then a hash function with low performance is not a good choice , recommend MurmurHash、Fnv these .

Big Value Split

Redis Because of its support setbit and getbit operation , And high pure memory performance , Therefore, natural can be used as Bloom filter . However, the improper use of Bloom filter is very easy to produce large Value, increase Redis Blocking risk , Therefore, it is recommended to split the bulky bloom filter in the generation environment .

There are many forms and methods of splitting , But the essence is not to Hash(Key) The subsequent requests are scattered in multiple small nodes of multiple nodes bitmap On , Instead, it should be split into multiple small bitmap after , To a Key All hash functions fall into this small bitmap On .

原网站

版权声明
本文为[Squat in the corner and count the ants]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202151557215302.html