当前位置:网站首页>Redis bloom filter
Redis bloom filter
2022-07-02 12:37:00 【Squat in the corner and count the ants】
What is a bloon filter
In essence, the bloom filter is a data structure , A more ingenious probabilistic data structure (probabilistic data structure), Features efficient insertion and query , It can be used to tell you “ Something must not exist or may exist ”.
Compared to traditional List、Set、Map And so on , It's more efficient 、 Take up less space , But the disadvantage is that the returned result is probabilistic , Not exactly .
Realization principle
HashMap The problem of
Before we talk about the principle of Bloom filter , Let's think about , Usually you judge whether an element exists by what ? Many people should answer HashMap Well , You can actually map values to HashMap Of Key, Then you can go to O(1) Return results within the time complexity of , Extremely efficient . however HashMap The implementation also has disadvantages , For example, the proportion of storage capacity is high , Considering the existence of load factor , Usually space cannot be used up , And once you are worth a lot, such as hundreds of millions , that HashMap The amount of memory occupied becomes considerable .
For example, your data set is stored on a remote server , Local services accept input , The data set is too large to be read into memory at one time HashMap When , There will be problems .
Data structure of Bloom filter
The bloom filter is a bit Vector or bit Array , Long like this :
If we want to map a value to a bloom filter , We need to use Multiple different hash functions Generate Multiple hash values , And for each generated hash value that points to bit Location 1, For example, for values “baidu” And three different hash functions generate hash values 1、4、7, Then the figure above changes to :
Ok, Let's save another value now “tencent”, If the hash function returns 3、4、8 Words , Figure goes on to :
It is worth noting that ,4 This bit Bit because the hash function for both values returns this bit position , So it's covered . Now if we want to check “dianping” Does this value exist , Hash function returned 1、5、8 Three values , As a result, we found that 5 This bit The value on bit is 0, It means that no value is mapped to this bit On a , So we can say for sure “dianping” This value does not exist . And when we need to query “baidu” If this value exists , The hash function must return 1、4、7, And then we checked and found these three bit All values on the bit are 1, So we can say “baidu” Does it exist ? The answer is no , Can only be “baidu” This value may exist .
Why is that ? Answer and simplicity , Because as the value increases, more and more , Be set to 1 Of bit More and more , Such a value “taobao” Even if it's not stored , But in case the hash function returns three bit Bits are set by other values 1 , Then the program will still judge “taobao” This value exists .
Do you support deletion
Thanks for the reminder in the comment area , Traditional Bloom filters don't support delete operations . But it's called Counting Bloom filter Can be used to test whether the number of element counts is absolutely less than a certain threshold , It supports element deletion . You can refer to the article Counting Bloom Filter Principle and implementation of
How to choose the number of hash functions and the length of Bloom filter
Obviously , Too small a bloon filter will soon all bit All positions are 1, Then any value queried will return “ Possible ”, Can't afford to filter . The length of the bloom filter directly affects the false alarm rate , The longer the bloom filter, the lower the false alarm rate .
in addition , The number of hash functions also needs to be weighed , The more the number, the more the bloom filter bit Position position 1 The faster , And the lower the efficiency of the bloom filter ; But if it's too little , Then our false alarm rate will be higher .
k Is the number of hash functions ,m Is the length of the bloom filter ,n Number of elements inserted for ,p Is the false positive rate
How to choose the right one for your business k and m Is it worth it , Here's a formula :
How to deduce this formula is just a sentence , Because it doesn't make much sense for use , If you ask a high school student to push, you will push quickly .k One time hash function bit Bit is not set to 1 The probability of is :
Insert n After three elements, it is still 0 The sum of the probabilities of is 1 The probabilities are :
Indicates whether an element is required in the collection k Both positions are set to... As described above 1, However, this method may make the algorithm mistakenly think that an element that is not originally in the set is detected as in the set (False Positives), The probability is determined by the following formula
Best practices
Common applications are , Use a bloom filter to reduce disks IO Or network request , Because once a value must not exist , We don't have to make subsequent expensive query requests .
in addition , Since you use the bloom filter to speed up the search and determine whether it exists , Then a hash function with low performance is not a good choice , recommend MurmurHash、Fnv these .
Big Value Split
Redis Because of its support setbit and getbit operation , And high pure memory performance , Therefore, natural can be used as Bloom filter . However, the improper use of Bloom filter is very easy to produce large Value, increase Redis Blocking risk , Therefore, it is recommended to split the bulky bloom filter in the generation environment .
There are many forms and methods of splitting , But the essence is not to Hash(Key) The subsequent requests are scattered in multiple small nodes of multiple nodes bitmap On , Instead, it should be split into multiple small bitmap after , To a Key All hash functions fall into this small bitmap On .
边栏推荐
- (C language) input a line of characters and count the number of English letters, spaces, numbers and other characters.
- 线性DP AcWing 895. 最长上升子序列
- 1380. Lucky numbers in the matrix [two-dimensional array, matrix]
- Sub thread get request
- MySQL and PostgreSQL methods to grab slow SQL
- PR 2021 quick start tutorial, learn about the and functions of the timeline panel
- Anti shake throttle
- kubeadm join时出现错误:[ERROR Port-10250]: Port 10250 is in use [ERROR FileAvailable--etc-kubernetes-pki
- 模块化 CommonJS ES Module
- drools中then部分的写法
猜你喜欢
随机推荐
Docker compose configuration mysql, redis, mongodb
Experiment of connecting mobile phone hotspot based on Arduino and esp8266 (successful)
线性DP AcWing 896. 最长上升子序列 II
FBX import under ue4/ue5 runtime
spfa AcWing 852. spfa判断负环
2.6 using recursion and stack - [tower of Hanoi problem]
AI中台技术调研
MySQL与PostgreSQL抓取慢sql的方法
Is the neural network (pinn) with embedded physical knowledge a pit?
MySQL indexes and transactions
Less than three months after the programmer was hired, the boss wanted to launch the app within one month. If he was dissatisfied, he was dismissed immediately
Fluent fluent library encapsulation
Why do programmers have the idea that code can run without moving? Is it poisonous? Or what?
Distributed machine learning framework and high-dimensional real-time recommendation system
Go学习笔记—多线程
深拷貝 事件總線
Redis avalanche, penetration, breakdown
In development, why do you find someone who is paid more than you but doesn't write any code?
How to write a pleasing English mathematical paper
OpenCV中cv2.VideoWriter_fourcc()函数和cv2.VideoWriter()函数的结合使用