当前位置:网站首页>Bitmap, bloom filter and hash sharding
Bitmap, bloom filter and hash sharding
2022-06-12 12:49:00 【Hello, classmate Feng】
List of articles
1、 Bitmap
1.1 The basic concept of bitmap
Bitmap is a very common data structure , In essence, it is a binary array . Bitmaps are similar to hash tables , Are mapped , But it's different . Each bit of the bitmap is used to represent a certain state of the data , For example, it exists or does not exist , It does not mean that the data itself . The hash table is used to store keywords key. Bitmap is more suitable for massive data processing and analysis .
The bitmap determines whether the data exists , There are two states , Being and not being , Then you can. Use a binary bit to represent information about whether data exists , If the binary bit is 1, On behalf of there , by 0 nonexistence , It is equivalent to the direct addressing method of hash table . such as :
1.2 The practical application of bitmap
- Quickly find out whether a data is in a collection
- Sort
- Find the intersection of two sets 、 And set etc.
- Disk block mark in the operating system
to 40 Hundreds of millions of unsigned integers that don't repeat , No preorder . Give an unsigned integer , How to quickly judge whether a number is here 40 Of the billion Method 1: Traverse , Time complexity O(N) Method 2: Sort (O(Nlog2N)), Using binary search : log2N Method 3: Bitmap resolution
Method 1 And methods 2 It's simple , At this point, I will not repeat too much .
Let's mainly look at the methods 3: First, we need to know that the maximum value of unsigned integer expression is 4294967295, Although there are 40 Million number , But most of them are repetitive , It will not exceed this maximum . In order to ensure that every number in the range of unsigned integers does not conflict , Bitmap needs 4294967295 bits , The final conversion is equivalent to need 511M Memory space , For today's ordinary computers , It is completely bearable . After the space is developed , For this 40 One by one mapping of 100 million numbers , Set the corresponding bit position to 1 that will do . As for how to operate , Look at the following .
1.3 Implementation of bitmap
template<size_t N>
class Biteset
{
public:
Biteset()
{
_bits.resize(N / 8 + 1, 0);// Every time I open one more , For example, we need to express 0-16, need 17 A bit
}
// take n The corresponding bit position is 1
void set(size_t n)
{
size_t i = n / 8;
size_t j = n % 8;
_bits[i] |= (1 << j);// The corresponding position bit 1
}
// take n The corresponding bit position is 0
void reset(size_t n)
{
size_t i = n / 8;
size_t j = n % 8;
_bits[i] &= (~(1 << j));// The corresponding position bit 0
}
// Judge whether a number exists
bool exist(size_t n)
{
size_t i = n / 8;
size_t j = n % 8;
return (_bits[i] & (1 << j));
}
std::vector<char> _bits;
};
because C++ The application space cannot be applied according to the number of bits , Only integer multiples of a single byte can be applied . So we apply for a type of char Of vector, every last char There are 8 A bit , That means 8 The existence state of the number .
In each member function ,i All means vector The number one in char( byte ),j Express this char( byte ) Bit number of the . This can be analyzed by drawing :
Code testing :
void test_bitset()
{
Biteset<100> Bset;
Bset.set(1);
Bset.set(5);
printf("%d\n", Bset._bits[0]);
std::cout << Bset.exist(1) << std::endl;
Bset.reset(2);
std::cout << Bset.exist(2) << std::endl;
}

The above bitmap only applies to the presence or absence of two states , How does that mean multiple states ? There is such a problem : Given 100 Million integers , Design algorithm to find an integer that only appears once . So a certain number may appear 2 Time 、3 Times or more , At this point, you need to extend the bit .
template<size_t N>
class TowBiteSet
{
public:
void Set(size_t n)
{
//00 ---- 01
if (!_bit1.exist(n) && !_bit2.exist(n))
{
_bit2.set(n);
}
//01 ---- 10
else if (!_bit1.exist(n) && _bit2.exist(n))
{
_bit1.set(n);
_bit2.reset(n);
}
//10 It means two or more times , No need to deal with
}
void PrintOnceNum()
{
for (size_t i = 0; i < N; ++i)
{
if (!_bit1.exist(i) && _bit2.exist(i))//01
cout << i << " ";
}
cout << endl;
}
private:
fl::Biteset<N> _bit1;// Indicates the second digit
fl::Biteset<N> _bit2;// Means first
};
If a number corresponds to 01, So when it reappears , You need to change the corresponding bit to 10, In fact, it is similar to carry operation .
Code testing :
void test_towbitset()
{
TowBiteSet<100> towset;
int arr[] = {
1, 5, 9, 6, 3, 25, 7, 8, 7, 8, 7, 9 };
for (auto& e : arr)
{
towset.Set(e);
}
towset.PrintOnceNum();
}

2、 The bloon filter
2.1 What is a bloon filter
The bloom filter is made of bloom (Burton Howard Bloom) stay 1970 Put forward in A compact 、 A more ingenious probabilistic data structure , Features efficient insertion and query , It can be used to tell you “ Something must not exist or may exist ”, It uses multiple hash functions , Map a data into a bitmap structure . This way can not only Improve query efficiency , It can also save a lot of memory space .
Bitmaps can only be used to determine the status of integers , If you need to judge the status of decimals or strings ? At this point, you need to use the bloom filter . Then how to achieve it ? Roughly speaking, data is mapped to a single or multiple bits through a single or multiple hash functions .
2.2 Advantages and disadvantages of Bloom filter
advantage :
- The time complexity of adding and querying elements is :O(K), (K Is the number of hash functions , Generally small ), It has nothing to do with the amount of data
- Hash functions have nothing to do with each other , Convenient hardware parallel operation
- The bloom filter does not need to store the elements themselves , It has great advantages in some occasions with strict confidentiality requirements
- When you can withstand a certain miscarriage of justice , Bloom filter has this great spatial advantage over other data structures
- When the data volume is large , The bloom filter can be used to represent the complete set , Other data structures cannot
- A bloom filter that uses the same set of hash functions can be interleaved 、 and 、 Difference operation
shortcoming :
- Misjudgment rate , That is, there is a false positive (False Position), That is, you can't accurately judge whether the element is in the set ( remedy : Create another white list , Store data that may be misjudged )
- Cannot get the element itself
- In general, you can't remove elements from a bloom filter
- If you delete by counting , There may be a counting loopback problem
2.3 The use of the bloon filter
When we use the news client to watch news , It will constantly recommend new content to us , Every time it's recommended, it has to be heavy , Get rid of what you've seen . The problem is coming. , News client recommendation system how to achieve push to heavy ? The server records all the historical records that the user has seen , When the recommendation system recommends news, it will filter from each user's history , Filter out existing records , For such a scenario, the bloom filter is used . Similar to that :
- Email filtering , Use bloom filter to filter email blacklist
- Filter the web address of the crawler , Those who have climbed no longer climb
- Tiktok 、 Short Kwai , The video that has been swiped will not be swiped again
- Registration time , You need to fill in the nickname , Determine whether the nickname is used by other users
- Some databases have built-in Bloom filters , Used to determine whether data exists , It can reduce the of database IO request , To improve performance
2.4 The principle of Bloom filter
The bloom filter is actually a binary array plus multiple random hash functions . The data is mapped to the corresponding bits through hash function calculation . If the corresponding bit has been set to 1, Then the mapping to another bit is calculated by another hash function , But the number of hash functions is limited , therefore There will be some misjudgment when judging the status of data .
The following is a simple schematic diagram of Bloom filter :
2.5 Misjudgment of Bloom filter
Suppose you now need to judge a data a And data b Whether there is , hypothesis a By hash function a,b,c Mapped to location 3、 Location 5 And location 13, The state of these three positions is existence , At this point, we think that the data a There is , But it doesn't really exist , This is the misjudgment of Bloom filter . hypothesis b By hash function a,b,c Mapped to location 7、 Location 8 And location 9, The status of these three positions is nonexistent , So the data b There must be no .
therefore : There may be misjudgment about the existence of data , For data does not exist , There must be no miscarriage of justice
The misjudgment of Bloom filter is inevitable , However, the misjudgment rate can be reduced through a variety of means , For example, the length of the bloom filter , Increase the number of hash functions to map multiple locations . So how to balance the length of Bloom filter and the number of hash functions ? Someone has given such a formula through calculation :
k Is the number of hash functions ,m Is the length of the bloom filter ,n Number of elements inserted for ,p Is the false positive rate .
Although the factors that affect the misjudgment rate are 3 individual , But I think the most influential factor is the length of the bloan filter
Why is that , I have little knowledge , I don't know the secret .
2.6 Implementation of the bloom filter
#pragma once
#include<bitset>
#include<string>
#include<iostream>
using namespace std;
// hash function 1
struct BKDRHash
{
size_t operator()(const string& s)
{
//BKDR Hash
size_t value = 0;
for (auto ch : s)
{
value *= 31;
value += ch;
}
return value;
}
};
// hash function 2
struct APHash{
size_t operator()(const string& s)
{
size_t hash = 0;
for (long i = 0; i < s.size(); i++)
{
if ((i & 1) == 0)
{
hash ^= ((hash << 7) ^ s[i] ^ (hash >> 3));
}
else
{
hash ^= (~((hash << 11) ^ s[i] ^ (hash >> 5)));
}
}
return hash;
}
};
// hash function 3
struct DJBHash{
size_t operator()(const string& s)
{
size_t hash = 5381;
for (auto ch : s)
{
hash += (hash << 5) + ch;
}
return hash;
}
};
template<size_t N,
class K = string,
class HashFunc1 = BKDRHash,
class HashFunc2 = APHash,
class HashFunc3 = DJBHash>
class BloomFilter
{
public:
void Set(const K& key)
{
size_t len = 4 * N;
size_t index1 = HashFunc1()(key) % len;
size_t index1 = HashFunc2()(key) % len;
size_t index1 = HashFunc3()(key) % len;
_bs.set(index1);
_bs.set(index2);
_bs.set(index3);
}
// Determine if the data exists , There may be misjudgment
void Test(const K& key)
{
size_t len = 4 * N;
size_t index1 = HashFunc1()(key) % len;
if (_bs.test(index1) == false)
return false;
size_t index2 = HashFunc2()(key) % len;
if (_bs.test(index2) == false)
return false;
size_t index3 = HashFunc3()(key) % len;
if (_bs.test(index3) == false)
return false;
// When 3 Only when all the positions exist , But it may be misjudged
return true;
}
private:
bitset<4*N> _bs;// According to the previous formula, we can calculate 1 Two data correspond to 4 A bit
};
We didn't write delete here , Why? ? Please look at the chart below. :

If you want to delete, it is not impossible . Each tag bit of the bloom filter is represented by multiple bits , Used to calculate how many data are mapped to this bit , Equivalent to reference count . But on the whole , It consumes more space , The advantages of the bloan filter are also reduced .
3、 Hash shard
Given two files A and B, There were 100 One hundred million query, We only have 1G Memory , How to find the intersection of two files ? Please give the exact algorithm
If this 100 One hundred million query The size is 100 individual G, The two files add up to 200 individual G
Method 1: Divide each file into equal sized 200 Small file , Each small file accounts for 500M, take A
The small files in are and respectively B One by one to find the intersection of small files in , Then will ask 20100(200+199+…+2+1) Time , The amount of calculation is also quite large , The main reason is the same query It may be in different small files , For example, one in A1 in , In a B10 in .
Method 2: For the method 1 The situation of , You can use hash sharding . Will be similar to query In a small file with the same number , It just needs A1 and B1 Compare ,A2 and B2 Compare … We only need to compare 200 Next time .
But there is still a problem , A single file is too large , May exceed 1G, At this point, you need to change a hash function to segment , Until satisfaction no longer exceeds 1G until .
边栏推荐
- Pytorch官方Faster R-CNN源代码解析(一)——特征提取
- MUI登录数据库完善与AJAX异步处理【MUI+Flask+MongoDB+HBuilderX】
- Object value taking method in JS And []
- VNCTF2022 [WEB]
- this.$ How to solve the problem when refs is undefined?
- ITK multi-stage registration
- vtk 三视图
- JS convert string to array object
- Summary of knowledge points of ES6, ES7, es8, es9, ES10, es11 and ES12 (interview)
- 2021-11-16
猜你喜欢

Influxdb2.x benchmark tool - influxdb comparisons

Attack and defense world re (New 1 hand zone) questions 1-12

嵌入式系统概述3-嵌入式系统的开发流程和学习基础、方法

Iterator, generator generator details

Buu question brushing record - 4

Introduction and test of MySQL partition table

About paiwen

关系代数笛卡尔积和自然连接的例子

Object. Detailed explanation of assign()

Pytorch官方Faster R-CNN源代码解析(一)——特征提取
随机推荐
三维坐标点拟合球(matlab and C )
数组——双指针技巧秒杀七道数组题目
Microsoft Word tutorial, how to insert a header or footer in word?
A "murder case" caused by ES setting operation
二叉树(构造篇)
Time series database - incluxdb2 docker installation
Advanced chapter of C language -- ten thousand words explanation pointer and qsort function
Reasons for college students' leave
嵌入式系统硬件构成-嵌入式系统硬件体系结构
二叉树(序列化篇)
机械臂雅可比矩阵IK
Overview of embedded system 3- development process, learning basis and methods of embedded system
itk::SymmetricForcesDemonsRegistrationFilter
Newton method for solving roots of polynomials
【C语言】关键字static&&多文件&&猜字游戏
itkMultiResolutionImageRegistrationMethod
SWI-Prolog的下载与使用
vant 标签栏+上拉加载+下拉刷新demo van-tabs+van-pull-refresh+van-list demo
Promise+ handwritten promise
数组——二维数组的花式遍历技巧