当前位置:网站首页>Redis bloom filter
Redis bloom filter
2022-07-02 12:37:00 【Squat in the corner and count the ants】
What is a bloon filter
In essence, the bloom filter is a data structure , A more ingenious probabilistic data structure (probabilistic data structure), Features efficient insertion and query , It can be used to tell you “ Something must not exist or may exist ”.
Compared to traditional List、Set、Map And so on , It's more efficient 、 Take up less space , But the disadvantage is that the returned result is probabilistic , Not exactly .
Realization principle
HashMap The problem of
Before we talk about the principle of Bloom filter , Let's think about , Usually you judge whether an element exists by what ? Many people should answer HashMap Well , You can actually map values to HashMap Of Key, Then you can go to O(1) Return results within the time complexity of , Extremely efficient . however HashMap The implementation also has disadvantages , For example, the proportion of storage capacity is high , Considering the existence of load factor , Usually space cannot be used up , And once you are worth a lot, such as hundreds of millions , that HashMap The amount of memory occupied becomes considerable .
For example, your data set is stored on a remote server , Local services accept input , The data set is too large to be read into memory at one time HashMap When , There will be problems .
Data structure of Bloom filter
The bloom filter is a bit Vector or bit Array , Long like this :

If we want to map a value to a bloom filter , We need to use Multiple different hash functions Generate Multiple hash values , And for each generated hash value that points to bit Location 1, For example, for values “baidu” And three different hash functions generate hash values 1、4、7, Then the figure above changes to :

Ok, Let's save another value now “tencent”, If the hash function returns 3、4、8 Words , Figure goes on to :

It is worth noting that ,4 This bit Bit because the hash function for both values returns this bit position , So it's covered . Now if we want to check “dianping” Does this value exist , Hash function returned 1、5、8 Three values , As a result, we found that 5 This bit The value on bit is 0, It means that no value is mapped to this bit On a , So we can say for sure “dianping” This value does not exist . And when we need to query “baidu” If this value exists , The hash function must return 1、4、7, And then we checked and found these three bit All values on the bit are 1, So we can say “baidu” Does it exist ? The answer is no , Can only be “baidu” This value may exist .
Why is that ? Answer and simplicity , Because as the value increases, more and more , Be set to 1 Of bit More and more , Such a value “taobao” Even if it's not stored , But in case the hash function returns three bit Bits are set by other values 1 , Then the program will still judge “taobao” This value exists .
Do you support deletion
Thanks for the reminder in the comment area , Traditional Bloom filters don't support delete operations . But it's called Counting Bloom filter Can be used to test whether the number of element counts is absolutely less than a certain threshold , It supports element deletion . You can refer to the article Counting Bloom Filter Principle and implementation of
How to choose the number of hash functions and the length of Bloom filter
Obviously , Too small a bloon filter will soon all bit All positions are 1, Then any value queried will return “ Possible ”, Can't afford to filter . The length of the bloom filter directly affects the false alarm rate , The longer the bloom filter, the lower the false alarm rate .
in addition , The number of hash functions also needs to be weighed , The more the number, the more the bloom filter bit Position position 1 The faster , And the lower the efficiency of the bloom filter ; But if it's too little , Then our false alarm rate will be higher .

k Is the number of hash functions ,m Is the length of the bloom filter ,n Number of elements inserted for ,p Is the false positive rate
How to choose the right one for your business k and m Is it worth it , Here's a formula :

How to deduce this formula is just a sentence , Because it doesn't make much sense for use , If you ask a high school student to push, you will push quickly .k One time hash function bit Bit is not set to 1 The probability of is :

Insert n After three elements, it is still 0 The sum of the probabilities of is 1 The probabilities are :
Indicates whether an element is required in the collection k Both positions are set to... As described above 1, However, this method may make the algorithm mistakenly think that an element that is not originally in the set is detected as in the set (False Positives), The probability is determined by the following formula

Best practices
Common applications are , Use a bloom filter to reduce disks IO Or network request , Because once a value must not exist , We don't have to make subsequent expensive query requests .
in addition , Since you use the bloom filter to speed up the search and determine whether it exists , Then a hash function with low performance is not a good choice , recommend MurmurHash、Fnv these .
Big Value Split
Redis Because of its support setbit and getbit operation , And high pure memory performance , Therefore, natural can be used as Bloom filter . However, the improper use of Bloom filter is very easy to produce large Value, increase Redis Blocking risk , Therefore, it is recommended to split the bulky bloom filter in the generation environment .
There are many forms and methods of splitting , But the essence is not to Hash(Key) The subsequent requests are scattered in multiple small nodes of multiple nodes bitmap On , Instead, it should be split into multiple small bitmap after , To a Key All hash functions fall into this small bitmap On .
边栏推荐
- The second composition template of postgraduate entrance examination English / chart composition, English chart composition is enough
- 防抖 节流
- AI中台技术调研
- Docker-compose配置Mysql,Redis,MongoDB
- Win10 system OmniPeek wireless packet capturing network card driver failed to install due to digital signature problem solution
- Shuttle encapsulated AppBar
- 分布式机器学习框架与高维实时推荐系统
- mysql数据库基础
- Less than three months after the programmer was hired, the boss wanted to launch the app within one month. If he was dissatisfied, he was dismissed immediately
- Lekao.com: experience sharing of junior economists and previous candidates in customs clearance
猜你喜欢

使用Sqoop把ADS层数据导出到MySQL

Brush questions --- binary tree --2
![[C language] convert decimal numbers to binary numbers](/img/9b/1848b68b95d98389ed985c83f2e856.png)
[C language] convert decimal numbers to binary numbers

This "little routine" is set on the dough cake of instant noodles. No wonder programmers are always hungry

高性能纠删码编码

MySQL and PostgreSQL methods to grab slow SQL

Sweetheart leader: Wang Xinling

Initial JDBC programming

区间DP AcWing 282. 石子合并

Floyd AcWing 854. Floyd求最短路
随机推荐
Drools terminates the execution of other rules after executing one rule
2.6 using recursion and stack - [tower of Hanoi problem]
Go learning notes - go based interprocess communication
深拷贝 事件总线
kubeadm join时出现错误:[ERROR Port-10250]: Port 10250 is in use [ERROR FileAvailable--etc-kubernetes-pki
Deep understanding of P-R curve, ROC and AUC
考研英语二大作文模板/图表作文,英语图表作文这一篇就够了
Sort---
[old horse of industrial control] detailed explanation of Siemens PLC TCP protocol
Go学习笔记—多线程
线性DP AcWing 895. 最长上升子序列
Fluent fluent library encapsulation
VLAN experiment
高性能纠删码编码
Go学习笔记—基于Go的进程间通信
What data types does redis have and their application scenarios
使用Sqoop把ADS层数据导出到MySQL
LeetCode—剑指 Offer 51. 数组中的逆序对
线性DP AcWing 902. 最短编辑距离
Lekao.com: experience sharing of junior economists and previous candidates in customs clearance