当前位置:网站首页>Explain in simple terms the bloom filter
Explain in simple terms the bloom filter
2022-06-22 19:36:00 【JackieZhengChina】
I've seen such a passage on the Internet before
Data structures are nothing different. They are like the bookshelves of your application where you can organize your data. Different data structures will give you different facility and benefits. To properly use the power and accessibility of the data structures you need to know the trade-offs of using one.
The main idea of this passage is : The data structure is no different , In applications, they are like bookshelves that can organize data , Different data structures will provide you with different convenience and benefits , You need to weigh your needs carefully and use them properly . The bloon filter Is the representative of practicing this sentence , So today, let's talk about bloom filter in simple terms (Bloom Filter).
The bloon filter (Bloom Filter)
As usual , Before learning a new knowledge , We need to understand his basic concepts
The bloon filter (Bloom Filter) yes 1970 Proposed by bron in . It's actually a very long binary vector and a series of random mapping functions . The bloom filter can be used to retrieve whether an element is in a collection . Its advantage is that the spatial efficiency and query time are much better than the general algorithm , The disadvantage is that it has certain error recognition rate and deletion difficulty .
We can see this sentence in the basic concept of Bloom filter , The bloom filter can be used to retrieve whether an element is in a collection . Some partners may ask : We just use HashMap Don't you just search for elements , Why use a bloom filter ?HashMap It can really help us realize this function , and HashMap The position of an element can be determined by one operation , It can also help us check whether an element is in a collection . Then let's think about another question : If there is... In this element set One billion A random number , How do we judge whether a certain number exists ?
If the element set contains a billion random numbers , Then first of all, it will bring us problems in space , Occupy by a random number 4 If the sub section is calculated , This billion random numbers will take up nearly 4G Storage space , Space consumption is very large , At this time, you need to use bloom filter (Bloom Filter) To help us solve this problem .
The basic idea of Bloom filter
As we mentioned above, the element set contains a billion random numbers , If you store these billions of elements directly, you need to take up a lot of space , So we definitely can't store elements directly . So how do we store it ? The storage method is also very ingenious , You can use digit groups , Do not store elements directly , It's the state of whether the storage element exists , This can save a lot of storage space ~( I don't know how the great gods came up with such a clever way )

Let's take a look at how the bloom filter determines whether an element is in a set
① Now there is a set and an initial state of 0 The number group of

② The elements in the collection pass through N A hash function calculates the position of the element in the array , And the corresponding position in the array 0 Change to 1

③ If you need to judge the element at this time X Whether there is , So the element X Will also pass through this N The operation of a hash function to obtain several positions in the array , If the values in several positions are 1, Then it proves that the element X It is likely to exist in the set , On the contrary, it proves that the element X It must not exist in the set . As shown in the figure below , This element X after N The values stored at the calculated positions of the hash elements are not all 1, Then prove that the element X It doesn't exist in the set .

The characteristics of the bloom filter
Above, we talked about the idea of Bloom filter , In the ③ There is such a sentence in the point : If the values in several positions are 1, Then it proves that the element X Probably Exist in and set . Why is it all 1 The situation is likely to exist , Not necessarily ? This is related to the characteristics of hash function ...
We all know that hash function is a function that can convert input data of any size into output data of a specific size ( The converted data is called Hashi value ), Hash functions also have the following two features :
- If the hash value obtained by the same hash function is different , Then the original input values of the two hash values must be different .
- If two hash values obtained from the same hash function are equal , The original input values of two hash values may be equal , It may not be equal .
This is similar to Java Of two objects in HashCode equal , But the object itself is not necessarily equal . To put it bluntly , The values of the projection points of the bit group calculated by the hash function are all 1, It doesn't have to be set when the variable to be queried was saved in before , It may also be the point mapped by other elements . This leads to a characteristic of Bloom filter : There is a certain miscalculation .
In the League of Heroes , You can fully trust bloom , But you have to be careful when writing code
So can we delete the elements in the digit group ? Obviously not , Because the same point on the bit array may have multiple input value mappings , If deleted, it will affect the judgment results of other elements in the bloom filter . This is another feature of Bloom filter : You cannot delete elements in the bloom filter .
So we can sum up the advantages and disadvantages of Bloom filter
advantage : In space and time , All have great advantages . Because it doesn't store complete data , Is a binary vector , It can save a lot of memory space , In terms of time complexity , Because the query is calculated according to the hash function , So suppose there is N A hash function , So time complexity is O(N); At the same time, when storing an element, it is not the element itself , It's a binary vector , Therefore, it has certain advantages in some scenes with strict confidentiality requirements .
shortcoming : There is a certain miscalculation ( The more elements stored in the bloom filter , The higher the miscalculation rate ); You cannot delete elements in the bloom filter , As the time goes by , Because... Cannot be deleted , More and more elements are stored in it , As a result, more and more memory is occupied , The misjudgment rate is getting higher and higher , Finally, I had to reset .
Application of bloon filter
We've all heard of “ Cache penetration ”, How should we solve cache penetration ? you 're right , It is through the bloom filter to solve this problem .
The problem of cache penetration is mainly due to the incoming Key Values in Redis There is no such thing as , Then it will be directly typed on the database , So as to increase the pressure on the database . In this case , Can be in Redis Add a bloom filter before , Add the data of... To the database in advance , In the query Redis First judge through the bloom filter Key Whether the value exists , If it doesn't exist, go straight back , If Key If the value exists , Then continue to execute according to the original process .
The solution to cache penetration is If the judgment result of Bloom filter is that it does not exist, it must not exist This feature , However, due to some misjudgment of Bloom filter , Therefore, cache penetration cannot be completely solved , But it can greatly alleviate the problem of cache penetration .
Analog implementation of Bloom filter
Finally, paste a piece of code , To simulate the implementation of Bloom filter
import java.util.BitSet;
/**
* @description: MyBloomFilter
* @author: Zhuang Ba .liziye
* @create: 2022-04-01 16:50
**/
public class MyBloomFilter {
/**
* A length of 10 Billion bits
*/
private static final int DEFAULT_SIZE = 256 << 22;
/**
* In order to reduce the error rate , Use addition hash Algorithm , So define a 8 A prime array of elements
*/
private static final int[] seeds = {3, 5, 7, 11, 13, 31, 37, 61};
/**
* It's equivalent to building 8 Different hash Algorithm
*/
private static HashFunction[] functions = new HashFunction[seeds.length];
/**
* Initialize the bloom filter bitmap, Positional array
*/
private static BitSet bitset = new BitSet(DEFAULT_SIZE);
/**
* Add data
*
* @param value Value to add
*/
public static void add(String value) {
if (value != null) {
for (HashFunction f : functions) {
// Calculation hash Value and modify bitmap The corresponding position in is true
bitset.set(f.hash(value), true);
}
}
}
/**
* Determine whether the corresponding element exists
* @param value The elements that need to be judged
* @return result
*/
public static boolean contains(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (HashFunction f : functions) {
ret = bitset.get(f.hash(value));
// One hash The function returns false And then jump out of the loop
if (!ret) {
break;
}
}
return ret;
}
/**
* test
*/
public static void main(String[] args) {
for (int i = 0; i < seeds.length; i++) {
functions[i] = new HashFunction(DEFAULT_SIZE, seeds[i]);
}
// add to 1 100 million elements
for (int i = 0; i < 100000000; i++) {
add(String.valueOf(i));
}
String id = "123456789";
add(id);
System.out.println(" Elements 123456789 Whether there is :" + contains(id));
System.out.println(" Elements 234567890 Whether there is :" + contains("234567890"));
}
}
class HashFunction {
private int size;
private int seed;
public HashFunction(int size, int seed) {
this.size = size;
this.seed = seed;
}
public int hash(String value) {
int result = 0;
int len = value.length();
for (int i = 0; i < len; i++) {
result = seed * result + value.charAt(i);
}
int r = (size - 1) & result;
return (size - 1) & result;
}
}
Copy code

author : He refused to cross the river
link :https://juejin.cn/post/7083294020323000356
source : Rare earth digs gold
The copyright belongs to the author . Commercial reprint please contact the author for authorization , Non-commercial reprint please indicate the source .
---------------------
author : Chestnut Shao
source :CSDN
original text :https://blog.csdn.net/weixin_38556197/article/details/124111236
Copyright notice : This is the author's original article , Please attach a link to the blog post !
Content analysis By:CSDN,CNBLOG One click reprint plugin for blog posts
边栏推荐
- 泡泡玛特:空洞的灵魂需要故事
- Detailed explanation of session mechanism and related applications of session
- 什么?HomeKit、米家、Aqara等生态也能通过智汀与天猫精灵生态联动?
- Aiops intelligent operation and maintenance experience sharing
- 数商云:解析B2B2C多用户商城系统架构设计思路,开启智能商城新时代
- Shell script explanation (VII) -- regular expression, sort, uniq, tr
- 贪心之区间问题(3)
- 最长公共子序列
- Shell programming specification and variables
- Programmer's tool encyclopedia [continuous update]
猜你喜欢

Digital business cloud: build a digital supply chain system to enable enterprises to optimize and upgrade their logistics supply chain

Typescript (7) generic

Digital commerce cloud: analyze the design idea of B2B2C multi-user mall system architecture, and open a new era of intelligent mall

Experiment 7 trigger

Agent model of structured model
组合学笔记(五)分配格中的链

A homekit enabled camera? Zhiting IPC camera IC1 unpacking experience

小波变换db4进行四层分解及其信号重构—matlab分析及C语言实现

一款支持HomeKit的摄像头?智汀 IPC摄像头IC1开箱体验

5g short message solution
随机推荐
有效的括号
插槽里如何判断text为数组
深入浅出聊布隆过滤器(Bloom Filter)
Shell script explanation (II) -- conditional test, if statement and case branch statement
Problems of different renderers running on the web with flutter2.0
K个一组翻转链表[链表拆解/翻转/拼装]
知识蒸馏之Focal and Global Knowledge Distillation for Detectors
远程访问及控制——SSH远程管理及TCP Wrappers 访问控制
Thread pool: reading the source code of threadpoolexcutor
贪心之区间问题(1)
实现领域驱动设计 - 使用ABP框架 - 解决方案概览
codeup最长回文子串
mysql数据库设计
84.(cesium篇)cesium模型在地形上运动
Notes on Combinatorics (V) chains in distributive lattice
Xintang nuc980 usage record: basic description of development environment preparation and compilation configuration
贪心之区间问题(4)
STM32 control matrix key, Hal library, cubemx configuration
Agent model of structured model
Programmer's tool encyclopedia [continuous update]