当前位置:网站首页>Understand bloomfilter in one article
Understand bloomfilter in one article
2022-07-04 12:40:00 【Red hat spongebob】
1 What is a bloon filter
The bloon filter (Bloom Filter) It was by bloon (Burton Howard Bloom) stay 1970 Put forward in . It's actually a very long binary vector and a series of random mapping functions , The bloom filter essentially consists of a length of m The bit vector of .
The basic bloom filter supports two operations : test and add to .
purpose : The bloom filter can be used to retrieve whether an element is in a collection .
Its advantage is that the space efficiency and query time are far more than the general algorithm , The disadvantage is that it has certain error recognition rate and deletion difficulty .
( Misidentification : The element doesn't exist but is , There will be no element in and not in , That is, it must be , No, it can't be )
2 Bloon filter and HashSet What's the difference
The bloon filter :
The basic idea : Through one Hash Function maps an element to a Bit array (Bit Array) A point in . thus , We just need to see if this point is 1 We'll know if it's in the set . This is the basic idea of bloon filter .
advantage : Small footprint , Efficient .
shortcoming : There is an error rate , Cannot delete element .
HashSet:
- The basic idea : utilize Hash The function converts the HashCode Take out , Put it in an internal array or linked list , When adding, deleting, modifying and checking elements, they are generally based on the elements HashCode To operate .
- advantage : Easy to use , The principle of simple .
- shortcoming : Large memory consumption , Low performance .
Sum up , Bloom filter is mainly applicable to the de duplication of a large amount of data , and HashSet Such data structures are suitable for the de duplication and inspection of small and medium-sized data .
3 Using the bloon filter
There are many ways to implement Bloom filters , such as :
- Guava Realization
- Redis Realization
- Realize it by yourself
- …
Here we use Google Guava Demonstrate the built-in bloom filter :
rely on :
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>23.0</version>
</dependency>
Code :
public class Test {
/** * Filter default capacity */
private static final int DEFAULT_SIZE = 1000000;
/** * Custom error rate , Default 0.03, Section (0,1) */
private static final double FPP = 0.02;
public static void main(String[] args) {
// Initialize a store string Bloom filters for data , The default error rate is 0.03
BloomFilter<String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), DEFAULT_SIZE, FPP);
// Used to store all actual key, For the existence of
Set<String> sets = new HashSet<>(DEFAULT_SIZE);
// Used to store all actual key, Used for removal
List<String> lists = new ArrayList<>(DEFAULT_SIZE);
// Insert a random string
for (int i = 0; i < DEFAULT_SIZE; i++) {
String uuid = UUID.randomUUID().toString();
// Add data
bf.put(uuid);
sets.add(uuid);
lists.add(uuid);
}
int rightNum = 0;
int wrongNum = 0;
for (int i = 0; i < 10000; i++) {
// 0-10000 Between , Can be 100 The number of dividers is 100 individual (100 Multiple )
String data = i % 100 == 0 ? lists.get(i / 100) : UUID.randomUUID().toString();
// This is used here. might, It doesn't look very confident , So if the bloom filter determines that it exists , We still have to go. sets Zhongshi hammer
if (bf.mightContain(data)) {
if (sets.contains(data)) {
rightNum++;
continue;
}
wrongNum++;
}
}
BigDecimal percent = new BigDecimal(wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
BigDecimal bingo = new BigDecimal(9900 - wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
System.out.println(" stay 100W Of the elements , Judge 100 It's an actual element , The bloon filter thinks there is :" + rightNum);
System.out.println(" stay 100W Of the elements , Judge 9900 An element that doesn't actually exist , Mistakenly believed to exist :" + wrongNum + ", shooting :" + bingo + ", Miscalculation rate :" + percent);
}
}
Running results :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-e97FaPon-1644722500048)( Understand bloom filter (BloomFilter)].assets/image-20220212151410175.png)
principle :
The core principle of Bloom filter is to use digit group , Use elements many times Hash Map to bit , When judging whether it exists , Map the searched elements in the same way , When an element may have multiple bits , When every bit is 1 It is proved that the element exists , On the contrary, it does not exist .
demonstration :
# Storage elements :
abc ---> {2,3,4} All for 1
def ---> {5,6,7} All for 1
# Look for the element :
abc ---> {2,3,4} All for 1 1==1&&1==1&&1==1 --->true There is
aaa ---> {2,3,3} All for 1 1==1&&1==1&&0==1 --->false non-existent
# There is an error :
acf ---> {2,3,6} All for 1 1==1&&1==1&&1==1 --->true There is
4 Handwritten simple bloom filter
public class MyBloomFilter {
/** * Use addition hash Algorithm , So it defines a 3 A prime array of elements */
private static final int[] primes = new int[]{
2, 3, 5};
/** * Use eight different prime numbers , It's equivalent to building 8 Different algorithms */
private Hash[] hashList = new Hash[primes.length];
/** * Create a length of 10 Billion bits */
private BitSet bits = new BitSet(1000000000);
public MyBloomFilter() {
for (int i = 0; i < primes.length; i++) {
// Use 3 A prime number , establish 3 Species algorithm
hashList[i] = new Hash(primes[i]);
}
}
/** * Additive elements * @param value */
public void add(String value) {
for (Hash f : hashList) {
bits.set(f.hash(value), true);
}
}
/** * Determine whether it is in the bloom filter * @param value * @return */
public boolean contains(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (Hash f : hashList) {
// see 3 Values on bits
ret = ret && bits.get(f.hash(value));
}
return ret;
}
/** * Add hash Algorithm */
public static class Hash {
private int prime;
public Hash(int prime) {
this.prime = prime;
}
public int hash(String key) {
int hash, i;
for (hash = key.length(), i = 0; i < key.length(); i++) {
hash += key.charAt(i);
}
return (hash % prime);
}
}
public static void main(String[] args) {
MyBloomFilter bloom = new MyBloomFilter();
System.out.println(bloom.contains("2222"));
bloom.add("2222");
// maintain 1 Billion online users
for (int i = 1; i < 100000000; i++) {
bloom.add(String.valueOf(i));
}
long begin = System.currentTimeMillis();
System.out.println(begin);
System.out.println(bloom.contains("2222"));
long end = System.currentTimeMillis();
System.out.println(end);
System.out.println(" Judge 2222 Is there any use :" + (end - begin));
}
}
Running results :
Reference article :
https://blog.csdn.net/wuzhiwei549/article/details/106714765
https://www.jasondavies.com/bloomfilter/
https://www.cnblogs.com/liyulong1982/p/6013002.html
https://www.jianshu.com/p/7634eaea3e26
边栏推荐
- JD home programmers delete databases and run away. Talk about binlog, the killer of MySQL data backup
- Lvs+kept highly available cluster
- Openssl3.0 learning 20 provider KDF
- Ml and NLP are still developing rapidly in 2021. Deepmind scientists recently summarized 15 bright research directions in the past year. Come and see which direction is suitable for your new pit
- Ultimate bug finding method - two points
- Abnormal mode of ARM processor
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 14
- C語言函數
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 5
- DVC use case (VI): Data Registry
猜你喜欢
Flet教程之 02 ElevatedButton高级功能(教程含源码)(教程含源码)
I want to talk about yesterday
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 10
French Data Protection Agency: using Google Analytics or violating gdpr
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 24
R language -- readr package reads and writes data
The frost peel off the purple dragon scale, and the xiariba people will talk about database SQL optimization and the principle of indexing (primary / secondary / clustered / non clustered)
Complementary knowledge of auto encoder
Awk getting started to proficient series - awk quick start
Servlet learning notes
随机推荐
BCD code Baidu Encyclopedia
Clion configuration of opencv
MySQL advanced review
Experiment 7. IPv6
'using an alias column in the where clause in PostgreSQL' - using an alias column in the where clause in PostgreSQL
Lecture 9
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 7
[ES6] template string: `string`, a new symbol in es2015
Data communication and network: ch13 Ethernet
Globalsign's SSL certificate products
17.内存分区与分页
Review of week 278 of leetcode II
Entity framework calls Max on null on records - Entity Framework calling Max on null on records
World document to picture
Star leap plan | new projects are continuously being recruited! MSR Asia MSR Redmond joint research program invites you to apply!
16.内存使用与分段
Awk getting started to proficient series - awk quick start
Introduction to the button control elevatedbutton of the fleet tutorial (the tutorial includes the source code)
[Android reverse] function interception instance (③ refresh CPU cache | ④ process interception function | ⑤ return specific results)
Snowflake won the 2021 annual database