当前位置:网站首页>Understand bloomfilter in one article
Understand bloomfilter in one article
2022-07-04 12:40:00 【Red hat spongebob】
1 What is a bloon filter
The bloon filter (Bloom Filter) It was by bloon (Burton Howard Bloom) stay 1970 Put forward in . It's actually a very long binary vector and a series of random mapping functions , The bloom filter essentially consists of a length of m The bit vector of .
The basic bloom filter supports two operations : test and add to .
purpose : The bloom filter can be used to retrieve whether an element is in a collection .
Its advantage is that the space efficiency and query time are far more than the general algorithm , The disadvantage is that it has certain error recognition rate and deletion difficulty .
( Misidentification : The element doesn't exist but is , There will be no element in and not in , That is, it must be , No, it can't be )
2 Bloon filter and HashSet What's the difference
The bloon filter :
The basic idea : Through one Hash Function maps an element to a Bit array (Bit Array) A point in . thus , We just need to see if this point is 1 We'll know if it's in the set . This is the basic idea of bloon filter .
advantage : Small footprint , Efficient .
shortcoming : There is an error rate , Cannot delete element .
HashSet:
- The basic idea : utilize Hash The function converts the HashCode Take out , Put it in an internal array or linked list , When adding, deleting, modifying and checking elements, they are generally based on the elements HashCode To operate .
- advantage : Easy to use , The principle of simple .
- shortcoming : Large memory consumption , Low performance .
Sum up , Bloom filter is mainly applicable to the de duplication of a large amount of data , and HashSet Such data structures are suitable for the de duplication and inspection of small and medium-sized data .
3 Using the bloon filter
There are many ways to implement Bloom filters , such as :
- Guava Realization
- Redis Realization
- Realize it by yourself
- …
Here we use Google Guava Demonstrate the built-in bloom filter :
rely on :
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>23.0</version>
</dependency>
Code :
public class Test {
/** * Filter default capacity */
private static final int DEFAULT_SIZE = 1000000;
/** * Custom error rate , Default 0.03, Section (0,1) */
private static final double FPP = 0.02;
public static void main(String[] args) {
// Initialize a store string Bloom filters for data , The default error rate is 0.03
BloomFilter<String> bf = BloomFilter.create(Funnels.stringFunnel(Charsets.UTF_8), DEFAULT_SIZE, FPP);
// Used to store all actual key, For the existence of
Set<String> sets = new HashSet<>(DEFAULT_SIZE);
// Used to store all actual key, Used for removal
List<String> lists = new ArrayList<>(DEFAULT_SIZE);
// Insert a random string
for (int i = 0; i < DEFAULT_SIZE; i++) {
String uuid = UUID.randomUUID().toString();
// Add data
bf.put(uuid);
sets.add(uuid);
lists.add(uuid);
}
int rightNum = 0;
int wrongNum = 0;
for (int i = 0; i < 10000; i++) {
// 0-10000 Between , Can be 100 The number of dividers is 100 individual (100 Multiple )
String data = i % 100 == 0 ? lists.get(i / 100) : UUID.randomUUID().toString();
// This is used here. might, It doesn't look very confident , So if the bloom filter determines that it exists , We still have to go. sets Zhongshi hammer
if (bf.mightContain(data)) {
if (sets.contains(data)) {
rightNum++;
continue;
}
wrongNum++;
}
}
BigDecimal percent = new BigDecimal(wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
BigDecimal bingo = new BigDecimal(9900 - wrongNum).divide(new BigDecimal(9900), 2, RoundingMode.HALF_UP);
System.out.println(" stay 100W Of the elements , Judge 100 It's an actual element , The bloon filter thinks there is :" + rightNum);
System.out.println(" stay 100W Of the elements , Judge 9900 An element that doesn't actually exist , Mistakenly believed to exist :" + wrongNum + ", shooting :" + bingo + ", Miscalculation rate :" + percent);
}
}
Running results :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-e97FaPon-1644722500048)( Understand bloom filter (BloomFilter)].assets/image-20220212151410175.png)
principle :
The core principle of Bloom filter is to use digit group , Use elements many times Hash Map to bit , When judging whether it exists , Map the searched elements in the same way , When an element may have multiple bits , When every bit is 1 It is proved that the element exists , On the contrary, it does not exist .
demonstration :
# Storage elements :
abc ---> {2,3,4} All for 1
def ---> {5,6,7} All for 1
# Look for the element :
abc ---> {2,3,4} All for 1 1==1&&1==1&&1==1 --->true There is
aaa ---> {2,3,3} All for 1 1==1&&1==1&&0==1 --->false non-existent
# There is an error :
acf ---> {2,3,6} All for 1 1==1&&1==1&&1==1 --->true There is
4 Handwritten simple bloom filter
public class MyBloomFilter {
/** * Use addition hash Algorithm , So it defines a 3 A prime array of elements */
private static final int[] primes = new int[]{
2, 3, 5};
/** * Use eight different prime numbers , It's equivalent to building 8 Different algorithms */
private Hash[] hashList = new Hash[primes.length];
/** * Create a length of 10 Billion bits */
private BitSet bits = new BitSet(1000000000);
public MyBloomFilter() {
for (int i = 0; i < primes.length; i++) {
// Use 3 A prime number , establish 3 Species algorithm
hashList[i] = new Hash(primes[i]);
}
}
/** * Additive elements * @param value */
public void add(String value) {
for (Hash f : hashList) {
bits.set(f.hash(value), true);
}
}
/** * Determine whether it is in the bloom filter * @param value * @return */
public boolean contains(String value) {
if (value == null) {
return false;
}
boolean ret = true;
for (Hash f : hashList) {
// see 3 Values on bits
ret = ret && bits.get(f.hash(value));
}
return ret;
}
/** * Add hash Algorithm */
public static class Hash {
private int prime;
public Hash(int prime) {
this.prime = prime;
}
public int hash(String key) {
int hash, i;
for (hash = key.length(), i = 0; i < key.length(); i++) {
hash += key.charAt(i);
}
return (hash % prime);
}
}
public static void main(String[] args) {
MyBloomFilter bloom = new MyBloomFilter();
System.out.println(bloom.contains("2222"));
bloom.add("2222");
// maintain 1 Billion online users
for (int i = 1; i < 100000000; i++) {
bloom.add(String.valueOf(i));
}
long begin = System.currentTimeMillis();
System.out.println(begin);
System.out.println(bloom.contains("2222"));
long end = System.currentTimeMillis();
System.out.println(end);
System.out.println(" Judge 2222 Is there any use :" + (end - begin));
}
}
Running results :
Reference article :
https://blog.csdn.net/wuzhiwei549/article/details/106714765
https://www.jasondavies.com/bloomfilter/
https://www.cnblogs.com/liyulong1982/p/6013002.html
https://www.jianshu.com/p/7634eaea3e26
边栏推荐
- Communication tutorial | overview of the first, second and third generation can bus
- Fly tutorial 02 advanced functions of elevatedbutton (tutorial includes source code) (tutorial includes source code)
- [Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 23
- Lecture 9
- Flet教程之 02 ElevatedButton高级功能(教程含源码)(教程含源码)
- In 2022, financial products are not guaranteed?
- How to use the mongodb ID array to get multiple documents- How to get multiple document using array of MongoDb id?
- Global and Chinese market of ice water machines 2022-2028: Research Report on technology, participants, trends, market size and share
- VIM, another program may be editing the same file If this is the solution of the case
- Langue C: trouver le nombre de palindromes dont 100 - 999 est un multiple de 7
猜你喜欢
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 23
Flet教程之 按钮控件 ElevatedButton入门(教程含源码)
SAP ui5 date type sap ui. model. type. Analysis of the display format of date
Entitas learning [3] multi context system
Hongke case study on storm impact in coastal areas of North Carolina using lidar
VIM, another program may be editing the same file If this is the solution of the case
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 19
C language array
The latest idea activation cracking tutorial, idea permanent activation code, the strongest in history
LVS load balancing cluster deployment - Dr direct routing mode
随机推荐
Detailed explanation of NPM installation and caching mechanism
Method of setting default items in C # ComboBox control code
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 11
Googgle guava ImmutableCollections
Servlet learning notes
Wechat video Number launches "creator traffic package"
AI should take code agriculture? Deepmind offers a programming version of "Alpha dog" alphacode that surpasses nearly half of programmers!
[Yunju entrepreneurial foundation notes] Chapter II entrepreneur test 8
MySQL performance optimization index
Fastlane one click package / release app - usage record and stepping on pit
Guava ImmutableSet. Builder source code analysis, shift original code, complement code, reverse code review
17. Memory partition and paging
When synchronized encounters this thing, there is a big hole, pay attention!
How to use the mongodb ID array to get multiple documents- How to get multiple document using array of MongoDb id?
《天天数学》连载57:二月二十六日
C语言函数
Memory computing integration: AI chip architecture in the post Moorish Era
Global and Chinese market of cardiac monitoring 2022-2028: Research Report on technology, participants, trends, market size and share
[solve the error of this pointing in the applet] SetData of undefined
Map container