One 、 Filter usage scenarios :
For example, there are several requirements :
1、 Originally 10 Million numbers , Now here comes... Again 10 Ten thousand numbers , We need to judge this quickly and accurately 10 Whether ten thousand numbers are in 10 In a hundred million numbers ?
Solution one : take 10 Hundreds of millions of numbers are stored in the database , Do database query , Accuracy has come to , But it will be slow .
Solution two : take 10 100 million numbers in memory , such as Redis In cache , Here we calculate the amount of memory used :10 Billion *8 byte =8GB, Query through memory , Accuracy and speed , But about 8gb Of memory space , It's a waste of memory space .
2、 Having been exposed to reptiles , There should be such a need , There are thousands of sites that need crawlers , For a new website url, How do we judge this url Whether we've climbed ?
The solutions are still the two above , Obviously , Not so good .
3、 In the same way, there is spam filtering .
So for things like this , Big data set , How to accurately and quickly judge whether a certain data is in a large amount of data set , And it doesn't take up memory , Filters came into being .
Two 、 The bloon filter (Bloom Filter)
The bloon filter : A data structure (bitmap), It's a long string of binary vectors , Think of it as a binary array . Since it's binary , So it doesn't contain 0, Namely 1, But the initial default values are 0.
The bloon filter uses exists() To determine whether an element exists in its own structure . When the bloom filter determines that a value exists , In fact, this value is only possible ; When it says a value doesn't exist , There must be no such value , The probability of misjudgment is about 1% about .
Workflow - Additive elements
Bloom filter is mainly composed of bit array and a series of hash Function constitution , The initial state of the digit group is 0.
The following is a brief description of the working process of Bloom filter , As shown in the figure below :
When using a bloom filter to add key when , Will use different hash Function pair key Hash the stored element values , This will result in multiple hash values . Calculate an integer index value based on the hash value , The index value and the length of the digit group are used for remainder operation , Finally get a digit group position , And change the value of this position to 1. Every hash The function will calculate a different position , Then change the corresponding position in the array to 1. Through the above process, the element addition is completed (add) operation .
Workflow - Determine whether an element exists
When we need to determine whether an element exists , The process is as follows : First, hash the given element again , Get the same bit group position as when adding elements , Determine whether the resulting positions are 1, If one of them is 0, Then it means that the element does not exist , If all are 1, It indicates that the element may exist .
Why is it possible “ There is ”
You may ask , Why is it possible to exist ? The reason is simple , Those are set as 1 The position of may also change due to the operation of other elements . such as , Elements 1 and Elements 2, These two elements change a position to 1( chart 1 Shown ). under these circumstances , We can't judge “ Elements 1” There must be , This is the root cause of misjudgment of Bloom filter .
Pass the new data through the above customized hash functions , Calculate the values separately , Then see if the corresponding places are all 1, If there is one that is not 1 The situation of , So we can say , The new data must not exist in this bloom filter .
On the other hand , If the value calculated by the hash function , The corresponding places are 1, So we can be sure that : Does this data necessarily exist in this bloom filter ?
The answer is No , Because a lot of different data goes through hash The result of the function will be repeated , So there's a place where other data passes through hash Function set to 1.
We can come to a conclusion : The bloom filter can judge that certain data must not exist , But there is no way to judge that there must be .
3. The advantages and disadvantages of the bloon filter
advantage : The advantages are obvious , Binary array , Very little memory , And the insertion and query speed is fast enough .
shortcoming : As data increases , The miscalculation rate will increase ; And there's no way to tell that the data must exist ; There is another important drawback , Unable to delete data .
1. Combine the algorithm with bitmap The data is in client
2. Algorithm in client,bitmap The data is in redis
3. Talk about algorithms and bitmap All the data redis
Upgraded bloom filter (Counting Bloom Filter)
The principle is to put the bit of bitmap Upgrade to counter (Counter). Additive elements , Just give it to the corresponding Counter , respectively, +1; Remove elements , Just give it to the corresponding Counter Subtract one... Respectively . At the cost of several times more storage space , To realize the deletion function
3、 ... and 、 Cuckoo filter (Cuckoo filter)
In order to solve the problem that the bloom filter cannot delete elements , The paper 《Cuckoo Filter:Better Than Bloom》 The author proposes a cuckoo filter . Compared to the cuckoo filter , The bloom filter has the following shortcomings : Poor query performance 、 The efficiency of space utilization is low 、 Reverse operation is not supported ( Delete ) And it doesn't support counting .
1、 Query performance is weak because the bloom filter requires multiple queries hash Function to detect multiple different sites in a bitmap , These sites span a wide range of memory , It can lead to CPU Cache row hit rate is low .
2、 The space efficiency is low because under the same misjudgment rate , The space utilization rate of cuckoo filter is significantly higher than that of bloom , Space can probably be saved 40% many . However, the bloom filter does not require that the length of the bitmap must be 2 The index of , The cuckoo filter must have this requirement . From this point of view , It seems that bloom filter has more spatial scalability .
3、 The problem of not supporting reverse deletion is really hitting the soft rib of Bloom filter . In a dynamic system, elements always come and go . The bloom filter is like an imprint , Come here, there will be traces , Even if you leave, you can't clean it up . For example, your system only left 1kw Elements , But on the whole, there are hundreds of millions of water elements , Bloom filter is helpless , It will store the imprints of these lost elements there forever . As time goes by , This filter will get more and more crowded , Until one day you find that its misjudgment rate is too high , Have to rebuild .
4、 Cuckoo
The simplest cuckoo hash structure is a one-dimensional array structure , There will be two. hash The algorithm maps the new element to two positions in the array . If one of the two positions is empty , Then you can put the elements directly in . But if both positions are full , It just kicks one at random , Then he occupied this position .
p1 = hash1(x) % l
p2 = hash2(x) % l Unlike cuckoos , The cuckoo hash algorithm will help these victims ( Squeezed eggs ) Look for other nests . Because each element can be placed in two places , As long as any free position , You can put it in . So the sad egg will see if his other position is free , If it is empty , Move over and everyone will be happy . But what if this position is occupied by others ? good , Then it will do it again 「 Dog in the manger 」, Transfer the victim's role to others . Then the new victim will repeat the process until all the eggs find their nests .
Cuckoo hash problem
But there's a problem , That is, if the array is too crowded , Kicking around hundreds of times without stopping , This will seriously affect the insertion efficiency . At this time, cuckoo hash will set a threshold , When the continuous nesting behavior exceeds a certain threshold , I think this array is almost full . At this time, it needs to be expanded , Repositioning all elements .
There will be another problem , That is, there may be a run cycle . For example, two different elements ,hash The next two positions are exactly the same , At this time, they have no problem in one position . But here comes the third element , it hash Then the position is the same as them , Obviously , There will be a run cycle . But let three different elements go through twice hash The back position is the same , This probability is not very high , Unless it's yours hash The algorithm is too frustrating .
Cuckoo hash algorithm's attitude towards this run cycle is that the array is too crowded , Need to expand ( It's not like that )
Optimize
The average space utilization of the cuckoo hash algorithm above is not high , Perhaps only 50%. At this percentage , The number of consecutive runs will soon exceed the threshold . The value of such a hash algorithm is not obvious , So it needs to be improved .
One of the improvements is to increase hash function , Let each element have more than two nests , But three nests 、 Four nests . This can greatly reduce the probability of collision , Increase space utilization to 95% about .
Another improvement is to hang multiple seats in each position of the array , So even if the two elements are hash In the same position , You don't have to 「 Dog in the manger 」, Because there are multiple seats here , You can sit anywhere you like . Unless there are more seats , A run is needed . Obviously, this will also significantly reduce the number of runs . The space utilization of this scheme is only 85% about , But the query efficiency will be very high , Multiple seats in the same memory space , Can be used effectively CPU Cache .
Therefore, a more efficient solution is to integrate the above two improved solutions , For example, use 4 individual hash function , On each position 2 A seat . In this way, we can get time efficiency , You can also get space efficiency . Such a combination can even increase the space utilization 99%, This is a great space efficiency .
Cuckoo filter
The cuckoo filter is the same as the cuckoo hash structure , It's also a one-dimensional array , But unlike cuckoo hash , Cuckoo hashes store the entire element , The cuckoo filter only stores the fingerprint information of the element ( How many? bit, Similar to bloom filter ). Here, the filter sacrifices the accuracy of the data for spatial efficiency . It is precisely because the fingerprint information of the element is stored , So there will be a misjudgment rate , This is the same as the bloom filter .
First of all, cuckoo filter will only use two hash function , But each position can hold multiple seats . these two items. hash The function selection is special , Because only fingerprint information can be stored in the filter . When the fingerprints in this position are run , It needs to calculate another dual position . The calculation of this dual position requires the element itself , Let's recall the previous hash position calculation formula .
fp = fingerprint(x)
p1 = hash1(x) % l
p2 = hash2(x) % l We know p1 and x The fingerprints of , There is no way to directly calculate p2 Of .
special hash function
The clever thing about cuckoo filter is that it designs a unique hash function , So that it can be based on p1 and Elemental fingerprints Calculate directly p2, Without the need for a complete x Elements .
fp = fingerprint(x)
p1 = hash(x)
p2 = p1 ^ hash(fp) // Exclusive or As can be seen from the above formula , When we know fp and p1, You can directly calculate p2. Similarly, if we know p2 and fp, It can also be calculated directly p1 —— Duality .p1 = p2 ^ hash(fp)
So we don't need to know the current position is p1 still p2, Just put the current location and hash(fp) The dual position can be obtained by XOR calculation . And just make sure hash(fp) != 0 To make sure p1 != p2, In this way, there will be no problem that kicking yourself leads to a dead cycle .
Maybe you'll ask why it's here hash The function does not need to modulo the length of the array ? It's actually needed , However, the cuckoo filter forces that the length of the array must be 2 The index of , So modulo the length of an array is equivalent to modulo hash At the end of the value n position . During XOR operation , Ignore the low n position Just other bits . The calculated position p Keep it low n Bit is the final dual position .









