当前位置：网站首页>Bitmap, bloom filter and hash sharding

Bitmap, bloom filter and hash sharding

2022-06-12 12:49:00 【Hello, classmate Feng】

List of articles

1、 Bitmap
2、 The bloon filter
3、 Hash shard

1、 Bitmap

1.1 The basic concept of bitmap

Bitmap is a very common data structure , In essence, it is a binary array . Bitmaps are similar to hash tables , Are mapped , But it's different . Each bit of the bitmap is used to represent a certain state of the data , For example, it exists or does not exist , It does not mean that the data itself . The hash table is used to store keywords key. Bitmap is more suitable for massive data processing and analysis .
The bitmap determines whether the data exists , There are two states , Being and not being , Then you can. Use a binary bit to represent information about whether data exists , If the binary bit is 1, On behalf of there , by 0 nonexistence , It is equivalent to the direct addressing method of hash table . such as ：
Insert picture description here

1.2 The practical application of bitmap

Quickly find out whether a data is in a collection
Sort
Find the intersection of two sets 、 And set etc.
Disk block mark in the operating system

to 40 Hundreds of millions of unsigned integers that don't repeat , No preorder . Give an unsigned integer , How to quickly judge whether a number is here 40 Of the billion
Method 1： Traverse , Time complexity O(N)
Method 2： Sort (O(Nlog₂N)), Using binary search : log₂N
Method 3: Bitmap resolution
Method 1 And methods 2 It's simple , At this point, I will not repeat too much .
Let's mainly look at the methods 3： First, we need to know that the maximum value of unsigned integer expression is 4294967295, Although there are 40 Million number , But most of them are repetitive , It will not exceed this maximum . In order to ensure that every number in the range of unsigned integers does not conflict , Bitmap needs 4294967295 bits , The final conversion is equivalent to need 511M Memory space , For today's ordinary computers , It is completely bearable . After the space is developed , For this 40 One by one mapping of 100 million numbers , Set the corresponding bit position to 1 that will do . As for how to operate , Look at the following .

1.3 Implementation of bitmap

template<size_t N>
class Biteset
{
    
public:

	Biteset()
	{
    
		_bits.resize(N / 8 + 1, 0);// Every time I open one more , For example, we need to express 0-16, need 17 A bit 
	}
	
	// take n The corresponding bit position is 1
	void set(size_t n)
	{
    
		size_t i = n / 8;
		size_t j = n % 8;
		_bits[i] |= (1 << j);// The corresponding position bit 1
	}

	// take n The corresponding bit position is 0
	void reset(size_t n)
	{
    
		size_t i = n / 8;
		size_t j = n % 8;
		_bits[i] &= (~(1 << j));// The corresponding position bit 0
	}

	// Judge whether a number exists 
	bool exist(size_t n)
	{
    
		size_t i = n / 8;
		size_t j = n % 8;
		return (_bits[i] & (1 << j));
	}


	std::vector<char> _bits;
};

because C++ The application space cannot be applied according to the number of bits , Only integer multiples of a single byte can be applied . So we apply for a type of char Of vector, every last char There are 8 A bit , That means 8 The existence state of the number .
In each member function ,i All means vector The number one in char( byte ),j Express this char( byte ) Bit number of the . This can be analyzed by drawing ：
Insert picture description here
Code testing ：

void test_bitset()
{
    
	Biteset<100> Bset;
	Bset.set(1);
	Bset.set(5);
	printf("%d\n", Bset._bits[0]);
	std::cout << Bset.exist(1) << std::endl;
	Bset.reset(2);
	std::cout << Bset.exist(2) << std::endl;
}

Insert picture description here
The above bitmap only applies to the presence or absence of two states , How does that mean multiple states ？ There is such a problem ： Given 100 Million integers , Design algorithm to find an integer that only appears once . So a certain number may appear 2 Time 、3 Times or more , At this point, you need to extend the bit .

template<size_t N>
class TowBiteSet
{
    
public:
	void Set(size_t n)
	{
    
		//00 ---- 01
		if (!_bit1.exist(n) && !_bit2.exist(n))
		{
    
			_bit2.set(n);
		}

		//01 ---- 10
		else if (!_bit1.exist(n) && _bit2.exist(n))
		{
    
			_bit1.set(n);
			_bit2.reset(n);
		}

		//10 It means two or more times , No need to deal with 
	}

	void PrintOnceNum()
	{
    
		for (size_t i = 0; i < N; ++i)
		{
    
			if (!_bit1.exist(i) && _bit2.exist(i))//01
				cout << i << " ";
		}
		cout << endl;
	}

private:
	fl::Biteset<N> _bit1;// Indicates the second digit 
	fl::Biteset<N> _bit2;// Means first 
};

If a number corresponds to 01, So when it reappears , You need to change the corresponding bit to 10, In fact, it is similar to carry operation .
Code testing ：

void test_towbitset()
{
    
	TowBiteSet<100> towset;
	int arr[] = {
     1, 5, 9, 6, 3, 25, 7, 8, 7, 8, 7, 9 };
	for (auto& e : arr)
	{
    
		towset.Set(e);
	}
	towset.PrintOnceNum();
}

Insert picture description here

2、 The bloon filter

2.1 What is a bloon filter

The bloom filter is made of bloom （Burton Howard Bloom） stay 1970 Put forward in A compact 、 A more ingenious probabilistic data structure , Features efficient insertion and query , It can be used to tell you “ Something must not exist or may exist ”, It uses multiple hash functions , Map a data into a bitmap structure . This way can not only Improve query efficiency , It can also save a lot of memory space .
Bitmaps can only be used to determine the status of integers , If you need to judge the status of decimals or strings ？ At this point, you need to use the bloom filter . Then how to achieve it ？ Roughly speaking, data is mapped to a single or multiple bits through a single or multiple hash functions .

2.2 Advantages and disadvantages of Bloom filter

advantage ：

The time complexity of adding and querying elements is :O(K), (K Is the number of hash functions , Generally small ), It has nothing to do with the amount of data
Hash functions have nothing to do with each other , Convenient hardware parallel operation
The bloom filter does not need to store the elements themselves , It has great advantages in some occasions with strict confidentiality requirements
When you can withstand a certain miscarriage of justice , Bloom filter has this great spatial advantage over other data structures
When the data volume is large , The bloom filter can be used to represent the complete set , Other data structures cannot
A bloom filter that uses the same set of hash functions can be interleaved 、 and 、 Difference operation

shortcoming ：

Misjudgment rate , That is, there is a false positive (False Position), That is, you can't accurately judge whether the element is in the set ( remedy ： Create another white list , Store data that may be misjudged )
Cannot get the element itself
In general, you can't remove elements from a bloom filter
If you delete by counting , There may be a counting loopback problem

2.3 The use of the bloon filter

When we use the news client to watch news , It will constantly recommend new content to us , Every time it's recommended, it has to be heavy , Get rid of what you've seen . The problem is coming. , News client recommendation system how to achieve push to heavy ？ The server records all the historical records that the user has seen , When the recommendation system recommends news, it will filter from each user's history , Filter out existing records , For such a scenario, the bloom filter is used . Similar to that ：

Email filtering , Use bloom filter to filter email blacklist
Filter the web address of the crawler , Those who have climbed no longer climb
Tiktok 、 Short Kwai , The video that has been swiped will not be swiped again
Registration time , You need to fill in the nickname , Determine whether the nickname is used by other users
Some databases have built-in Bloom filters , Used to determine whether data exists , It can reduce the of database IO request , To improve performance

2.4 The principle of Bloom filter

The bloom filter is actually a binary array plus multiple random hash functions . The data is mapped to the corresponding bits through hash function calculation . If the corresponding bit has been set to 1, Then the mapping to another bit is calculated by another hash function , But the number of hash functions is limited , therefore There will be some misjudgment when judging the status of data .
The following is a simple schematic diagram of Bloom filter ：
Insert picture description here

2.5 Misjudgment of Bloom filter

Suppose you now need to judge a data a And data b Whether there is , hypothesis a By hash function a,b,c Mapped to location 3、 Location 5 And location 13, The state of these three positions is existence , At this point, we think that the data a There is , But it doesn't really exist , This is the misjudgment of Bloom filter . hypothesis b By hash function a,b,c Mapped to location 7、 Location 8 And location 9, The status of these three positions is nonexistent , So the data b There must be no .
therefore ： There may be misjudgment about the existence of data , For data does not exist , There must be no miscarriage of justice

The misjudgment of Bloom filter is inevitable , However, the misjudgment rate can be reduced through a variety of means , For example, the length of the bloom filter , Increase the number of hash functions to map multiple locations . So how to balance the length of Bloom filter and the number of hash functions ？ Someone has given such a formula through calculation ：
Insert picture description here
k Is the number of hash functions ,m Is the length of the bloom filter ,n Number of elements inserted for ,p Is the false positive rate .
Although the factors that affect the misjudgment rate are 3 individual , But I think the most influential factor is the length of the bloan filter
Why is that , I have little knowledge , I don't know the secret .

2.6 Implementation of the bloom filter

#pragma once 
#include<bitset>
#include<string>
#include<iostream>
using namespace std;

// hash function 1
struct BKDRHash
{
    
	size_t operator()(const string& s)
	{
    
		//BKDR Hash 
		size_t value = 0;
		for (auto ch : s)
		{
    
			value *= 31;
			value += ch;
		}
		return value;
	}
};

// hash function 2
struct APHash{
    
	size_t operator()(const string& s)
	{
    
		size_t hash = 0;
		for (long i = 0; i < s.size(); i++)
		{
    
			if ((i & 1) == 0)
			{
    
				hash ^= ((hash << 7) ^ s[i] ^ (hash >> 3));
			}
			else
			{
    
				hash ^= (~((hash << 11) ^ s[i] ^ (hash >> 5)));
			}
		}
		return hash;
	}
};

// hash function 3
struct DJBHash{
    
	size_t operator()(const string& s)
	{
    
		size_t hash = 5381;
		for (auto ch : s)
		{
    
			hash += (hash << 5) + ch;
		}
		return hash;
	}
};

template<size_t N, 
class K = string,
class HashFunc1 = BKDRHash,
class HashFunc2 = APHash,
class HashFunc3 = DJBHash>
class BloomFilter
{
    
public:
	void Set(const K& key)
	{
    
		size_t len = 4 * N;
		size_t index1 = HashFunc1()(key) % len;
		size_t index1 = HashFunc2()(key) % len;
		size_t index1 = HashFunc3()(key) % len;
		_bs.set(index1);
		_bs.set(index2);
		_bs.set(index3);
	}

	// Determine if the data exists , There may be misjudgment 
	void Test(const K& key)
	{
    
		size_t len = 4 * N;
		size_t index1 = HashFunc1()(key) % len;
		if (_bs.test(index1) == false)
			return false;

		size_t index2 = HashFunc2()(key) % len;
		if (_bs.test(index2) == false)
			return false;

		size_t index3 = HashFunc3()(key) % len;
		if (_bs.test(index3) == false)
			return false;
		
		// When 3 Only when all the positions exist , But it may be misjudged 
		return true;
	}

private:
	bitset<4*N> _bs;// According to the previous formula, we can calculate 1 Two data correspond to 4 A bit 
};

We didn't write delete here , Why? ？ Please look at the chart below. ：

Insert picture description here

If you want to delete, it is not impossible . Each tag bit of the bloom filter is represented by multiple bits , Used to calculate how many data are mapped to this bit , Equivalent to reference count . But on the whole , It consumes more space , The advantages of the bloan filter are also reduced .
Insert picture description here

3、 Hash shard

Given two files A and B, There were 100 One hundred million query, We only have 1G Memory , How to find the intersection of two files ？ Please give the exact algorithm
If this 100 One hundred million query The size is 100 individual G, The two files add up to 200 individual G

Method 1： Divide each file into equal sized 200 Small file , Each small file accounts for 500M, take A
The small files in are and respectively B One by one to find the intersection of small files in , Then will ask 20100(200+199+…+2+1) Time , The amount of calculation is also quite large , The main reason is the same query It may be in different small files , For example, one in A1 in , In a B10 in .

Method 2： For the method 1 The situation of , You can use hash sharding . Will be similar to query In a small file with the same number , It just needs A1 and B1 Compare ,A2 and B2 Compare … We only need to compare 200 Next time .
Insert picture description here
But there is still a problem , A single file is too large , May exceed 1G, At this point, you need to change a hash function to segment , Until satisfaction no longer exceeds 1G until .