当前位置：网站首页>Principle understanding and application of hash table and consistent hash

Principle understanding and application of hash table and consistent hash

2022-07-27 03:43:00 【Aries_ Ro】

hash Table consistency hash

Here's the catalog title

hash Table consistency hash
Hashtable
Distributed consistency hash

Hashtable

Hashtable （hash） Also known as hash table, it is a data structure , According to key Calculation key Position in the table ; yes key Mapping relationship with the storage address （ Such as ：Hash(key)=address） .

In the node of the hash table key and value Is stored together ; As shown below

struct node {
    
void *key;
void *val;
struct node *next;
};

The hash table has several factors ：hash function 、 Load factor

hash function

hash A function is a mapping relationship of addresses Hash(key)=address ;

however hash The function may put two or more different key Map to the same address , This situation is called Hash Collisions （ perhaps hash Collision ）;

Several classic hash function ：
murmurhash1,murmurhash2,murmurhash3,siphash（ redis6.0 Use of ,rust And most languages siphash Algorithm to achieve hashmap ）,cityhash All have Strong random distribution ;

Load factor

Load factor Is to describe hash Conflict The intensity of .

Load factor =（ The number of elements stored in the array / Data length ）; Used to describe the storage density of hash tables ;
The smaller the load factor , The less conflict , The greater the load factor , The greater the conflict ;

Conflict handling

The linked list method

Linked list method is to use conflict elements Linked list link ; This is also a common way to deal with conflicts .

But there may be an extreme case , There are many conflict elements , The The conflict list is too long , The time complexity of finding data also becomes higher .

At this time, you can convert this linked list into Red and black trees 、 The smallest pile ; From the time complexity of the original linked list to the time complexity of the red black tree ;

Open addressing

Open addressing method is to store all elements in the array of hash table ,** No additional data structures ;** Generally used Linear probe How to solve this problem ;

When inserting a new element , Use hash function to locate element position in hash table ;
Check whether there is an element in the slot index in the array . If the slot is empty , The insert , otherwise 3;
stay The first 2 Add a certain step length to the slot index of step detection, and then check ; Add a certain step length and divide into the following ：
i+1,i+2,i+3,i+4, … ,i+n
i- ${1}^{2}$ ,i+ ${2}^{2}$ ,i- ${3}^{2}$ ,1+ ${4}^{2}$ , … These two methods can temporarily resolve hash conflicts , But it will lead to similar hash Gather ; That is, the approximate value of its hash The value is also approximate , Then its array slot is also close to , formation hash Gather ; The first kind of congeneric aggregation conflict comes first , The second is simply to postpone the gathering conflict ; have access to Double hash To solve the problems above hash Aggregation phenomenon ：

STL Application of hash table in

stay STL in unordered_map 、unordered_set 、unordered_multimap
、unordered_multiset The underlying implementation of the four data structures is Hashtable ;

Different from ordinary hash tables ：

Ordinary hash table ： There will be a linked list on a hash value

Insert picture description here

STL Medium “ Hashtable ”

Insert picture description here

STL The open chain method is also used to solve hash Conflict , But it is not a hash value, a linked list , It is Whole hash Table is A linked list （ It should be for better implementation STL Iterators in , Only then has this kind of structure ）. Inserting a new data will adopt the chain header interpolation , If the sending hash conflicts, the linked list will be used to insert data into the node .

Hashtable VS Binary tree

Both hash table and balanced binary tree can query whether a string exists from massive data .

The time complexity of adding, deleting, modifying and checking balanced binary tree is O(logn) ; namely 100 Ten thousand nodes , Compare at most 20 Time ;10 Billion nodes , Compare at most 30 Time ;

The purpose of balance is to add, delete and modify , Ensure that the next search can stably eliminate half of the data ;
summary ： Balanced binary tree through node comparison , Both guarantee Orderly , You can also exclude half of the elements each time to achieve the purpose of rapid indexing ; The lookup time complexity of hash table is O(1), But the order of data cannot be guaranteed . So both have their own advantages and disadvantages .

Distributed consistency hash

background ： Distributed consistency hash The algorithm organizes the hash space into a Virtual ring , The number of nodes of the ring is ${2}^{32}$ ;
Algorithm for ： hash(ip) % ${2}^{32}$ , In the end, you'll get one [0, ${2}^{32}$ -1] An unsigned integer between , This integer represents the number of the server ; Multiple servers are running in this way hash Map a point on the ring to identify the location of the server ; When a user operates a key, Generate a value by the same algorithm , Position a server clockwise along the ring , Then the key In this server .

As shown in the figure ：

Insert picture description here

In the figure, we first pass hash(ip) %( ${2}^{32}$ ) Calculate the location of four servers , Re pass hash(ip) %( ${2}^{32}$ ) Calculate the user's location . For example, users 1 Where “ Yellow dot ”, Look for the nearest server clockwise 3, Then the user 1 Of key There are servers 3 in . Similarly, users 2 There are servers 2 in .

Application scenarios

Distribute data evenly among different servers , To share the pressure of the cache server ;
The change in the number of cache servers should not affect cache invalidation ;

Hash offset

The result of hash algorithm is random , There is no guarantee that the server nodes are evenly distributed over the hash ring , This is it. Hash offset ;

Uneven distribution will cause uneven request access , This leads to uneven pressure on the server ; As shown in the figure ：

Insert picture description here

Many users are saved on the server 1 in , The server 1 The pressure . And the server 3 It may be easier .

To solve the problem of hash offset , Introduced Virtual node The concept of .

Virtual node

Theoretically , The more nodes on the hash ring , The more evenly distributed the data ; To solve the problem of hash offset , Multiple hash nodes can be calculated for each service node （ The new hash node is the virtual node ）;

The usual way is , hash(“IP:PORT:seqno”)%（ ${2}^{32}$ ） ; For example, put the above server 3 Add several virtual nodes （ The new virtual nodes will also be randomly distributed on the hash ring , Not necessarily with service 3 adjacent ）, Add servers 3 The amount of work , Thus, the server 1 The amount of work .

Delete and add nodes

Suppose you delete the server 2, It was originally stored on the server 2 Users of 2, It will be taken out and found the server clockwise again 1 node , Save on the server 1 in .

Insert picture description here

If you add a server node , The added server node will be found counterclockwise to all user nodes between the previous server and saved in the new server node .

原网站

版权声明
本文为[Aries_ Ro]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207270010248305.html