当前位置:网站首页>Redis bloom filter
Redis bloom filter
2022-07-23 15:07:00 【Super code meow】
1. What is a bloon filter
The bloon filter (Bloom Filter) yes 1970 Proposed by bron in . It's actually a very long binary vector and a series of random mapping functions . The bloom filter can be used to retrieve whether an element is in a collection . Its advantage is that the spatial efficiency and query time are much better than the general algorithm , The disadvantage is that it has certain error recognition rate and deletion difficulty .
2. hash function
Bloom filter is inseparable from hash function , Therefore, it is necessary to introduce the concept of hash function , Properties of hash function :
- Classic hash functions have an infinite input range ( infinity ).
- The output fields of classic hash functions are fixed ( Poor and big , Suppose the output field is S).
- When the same value is passed into the hash function , The return value must be the same .
- When different input values are passed into the hash function , The return value may be the same , It may not be the same .
- The input values will be distributed as evenly as possible in S On .
The first three points are the basis of hash function , The fourth point describes the phenomenon of hash collision in hash function , Because the input field is infinite , The output field is finite , It's inevitable , There will be different values in the input field corresponding to the input field S in . The fifth thing is the key to evaluate a hash function , The better the hash function , The more uniform the distribution is, and it has nothing to do with the law of the occurrence of the input value . Like being “hash1”,“hash2”,“hash3” The three input values are similar , The result of hash function calculation should be very different , You can use the common MD5 and SHA1 Algorithm to verify these characteristics . If an excellent function can achieve different input values, the return values can be evenly distributed in S in , Return it to the value pair m Remainder (%m), The returned values obtained can also be considered to be evenly distributed in 0~m-1 position .
3. Principle of bloon filter
The bloom filter is a bit Vector or bit Array , As shown in the figure below :
If we want to map a value to a bloom filter , We need to use multiple different hash functions to generate multiple hash values , And for each generated hash value that points to bit Location 1, For example, for values “baidu” And three different hash functions generate hash values 1、4、7, Then the figure above changes to :
Ok, Let's save another value now “ali”, If the hash function returns 3、4、8 Words , Figure goes on to :
It is worth noting that ,4 This bit Bit because the hash function for both values returns this bit position , So it's covered . Now if we want to check “dianping” Does this value exist , Hash function returned 1、5、8 Three values , As a result, we found that 5 This bit The value on bit is 0, It means that no value is mapped to this bit On a , So we can say for sure “dianping” This value does not exist . And when we need to query “baidu” If this value exists , The hash function must return 1、4、7, And then we checked and found these three bit All values on the bit are 1, So we can say “baidu” Does it exist ? The answer is no , Can only be “baidu” This value may exist .
Why is that ? Answer and simplicity , Because as the value increases, more and more , Be set to 1 Of bit More and more , Such a value “taobao” Even if it's not stored , But in case the hash function returns three bit Bits are set by other values 1 , Then the program will still judge “taobao” This value exists .
4. Blum filter applications
For example, a website has 20 Billion url There's a blacklist , How to save this blacklist ? If you input any url, How do you quickly judge the url Is it on the blacklist ? And in a given memory space ( such as :500M) To judge quickly from within .
Maybe the first thing that many people think of is using HashSet, because HashSet be based on HashMap, Theoretically, the time complexity is :O(1). Achieved a quick goal , But the complexity of space ?URL Strings are passed through Hash Get one Integer Value ,Integer Occupy 4 Bytes , that 20 One hundred million URL In theory, we need :
20 Billion *4/1024/1024/1024=7.45G Of memory , Does not meet the requirements of space complexity . Here is the introduction of this article “ The bloon filter ”.
If you want to determine whether an element is in a set , The general idea is to save all the elements in the collection , And then through comparison to determine . Linked list 、 Trees 、 Hash table ( Also called hash table ,Hash table) And so on, the data structure is this way of thinking , The storage location is either disk , Or memory . Most of the time, it is either time for space , Or trade space for time .
In the case of strict response time requirements , If we exist inside , So as the number of elements in the set increases , We need more and more storage space , And the retrieval time is getting longer and longer , This leads to too much memory overhead 、 Time efficiency becomes low .
At this time, the problem to be considered and solved is , In the case of large amount of data , Both meet the time requirements , And meet the requirements of space . That is, we need a data structure and algorithm that consumes less time and space .Bloom Filter It's a solution .
5. Features of bloon filter
- Because of using hash judgment , Time efficiency is very high . Space efficiency is also a big advantage .
- There is a possibility of miscalculation , It needs to be used for specific scenarios .
- Because hash collisions cannot be resolved , So it's not very good to delete .
6. Use scenarios
The great use of the bloon filter is , Can quickly determine whether an element is in a set . Its common use scenarios are as follows :
- The blacklist : anti-spam , Judging whether a mailbox is spam or not from billions of spam lists ( Empathy , Spam messages )
- URL duplicate removal : Web crawler right URL De duplication of , Avoid climbing the same URL Address
- Word spell check
- Key-Value Cache system Key check ( Cache penetration ) : Cache penetration , Put all possible data cache in the bloom filter , When hackers access the nonexistent cache, they can quickly return to avoid cache and DB Hang up .
- ID check , For example, the order system queries an order ID Whether there is , If it doesn't exist, go straight back .
7. Best practices
Common applications are , Use a bloom filter to reduce disks IO Or network request , Because once a value must not exist , We don't have to make subsequent expensive query requests .
in addition , Since you use the bloom filter to speed up the search and determine whether it exists , Then a hash function with low performance is not a good choice , recommend MurmurHash、Fnv these .
8. Big Value Split
Redis Because of its support setbit and getbit operation , And high pure memory performance , Therefore, natural can be used as Bloom filter . However, the improper use of Bloom filter is very easy to produce large Value, increase Redis Blocking risk , Therefore, it is recommended to split the bulky bloom filter in the generation environment .
There are many forms and methods of splitting , But the essence is not to Hash(Key) The subsequent requests are scattered in multiple small nodes of multiple nodes bitmap On , Instead, it should be split into multiple small bitmap after , To a Key All hash functions fall into this small bitmap On .
边栏推荐
猜你喜欢

Live classroom system 03 supplement model class and entity

leetcode: 17. 电话号码的字母组合

@FeignClient使用详细教程(图解)

基于matlab的CBOC信号调制解调仿真,输出其相关性,功率谱以及频偏跟踪

Simulink simulation of ESP three-phase SVPWM controller

颜值爆表 Redis官方可视化工具来啦,针不戳

头部姿态估计原理及可视化_loveliuzz的博客-程序员宅基地_头部姿态估计

OpenCV计算外包矩形

Live classroom system 03 model class and entity

Supervisor installation and use
随机推荐
leetcode-设置交集大小至少为2
Zhongwang CAD professional 2022 software installation package download and installation tutorial
Regular expression common syntax parsing
粒子边界碰撞的处理
力扣-单调栈
基于PSO优化的多目标最优值求解matlab仿真
Russia hopes to effectively implement the "package" agreement on the export of agricultural products
【启发式分治】启发式合并的逆思想
uniapp实现横向点击滑动菜单
动态规划-力扣
turbo编译码误码率性能matlab仿真
读写锁ReadWriteLock还是不够快?再试试S…
Prometheus入门使用(三)
FastAPI应用加入Nacos
linux定时备份数据库脚本
【无标题】
Multiple backpacks!
QT document reading notes audio example analysis
MySQL 常用命令
Matlab simulation of solving multi-objective optimal value based on PSO optimization