当前位置：网站首页>Take you to resolve hash conflicts and implement a simple hash table,

Take you to resolve hash conflicts and implement a simple hash table,

2022-06-26 18:11:00 【A goblin】

List of articles

1. Concept
2, Conflict — Concept
3, Conflict — avoid
4, Conflict — avoid — Hash function design
5, Conflict - avoid - Load factor regulation （ Focus on mastering ）
6, Conflict — solve
7, Hash — solve — Closed hash
8, Conflict — solve — Hash / Hash bucket （ Focus on mastering ）
9, A solution to a serious conflict
10, Implement a hash table
11, Performance analysis

1. Concept

Sequence structure and balance tree , There is no corresponding relationship between the element key and its storage location , So when looking for an element , You have to go through key
Multiple comparisons of codes . The time complexity of sequential search is O(N), Balance the height of the tree in the tree , namely O( $log_2 N$ ), The efficiency of a search depends on having searched
Comparison times of elements in the process .

Ideal search method ： Without any comparison , Get the elements to search directly from the table at a time . If you construct a storage structure , By some kind of letter
Count (hashFunc) A one-to-one mapping relationship can be established between the storage location of an element and its key , Then when searching, you can quickly
Find the element .

When in this structure ：

Insert elements
According to the key of the element to be inserted , This function calculates the storage location of the element and stores it according to this location
Search element
Do the same calculation for the key of the element , Take the function value as the storage location of the element , Take the element comparison in this position in the structure , if
The keys are equal , Then the search is successful

This is called hash ( hash ) Method , The conversion function used in the hash method is called hash ( hash ) function , The constructed structure is called a hash table (Hash
Table)( Or a hash table )

for example ： Data set {1,7,6,4,5,9};
The hash function is set to ：hash(key) = key % capacity; capacity Is the total size of the underlying space of the storage element .

Insert picture description here
Using this method to search does not need to compare multiple keys , So the search speed is faster

problem ： According to the above hash method , Insert a meta... Into the collection plain 44, What will happen ？

Obviously 44%10=4 , In this case, the array subscript is 4 An element already exists at the location of , This is the problem we face Hash Collisions （ Hash collision ）

2, Conflict — Concept

For the keywords of two data elements $k_i$ and $k_j$ (i != j), Yes $k_i$ != $k_j$ , But there are ：Hash( $k_i$ ) == Hash( $k_j$ ), namely ： Different Keyword calculate the same hash address through the same hash hash number , This phenomenon is called hash collision or hash collision .
Data elements with different keys and the same hash address are called “ A synonym for ”.

3, Conflict — avoid

First , We need to be clear , Because the capacity of the underlying array of our hash table is often less than the actual number of keywords to be stored , This leads to a A question , Conflict is inevitable , But what we can do is try our best Reduce the conflict rate .

4, Conflict — avoid — Hash function design

One possible cause of hash conflict is ： Hash function design is not reasonable . Hash function design principles ：

The definition field of hash function must include all keys to be stored , And if the hash table allows m An address , Its value range must be in 0 To m-1 Between
The address calculated by hash function can be evenly distributed in the whole space
Hash functions should be simpler

Common hash functions
1). Direct customization –( Commonly used )
Take a linear function of keyword as hash address ：Hash（Key）= A*Key + B advantage ： Simple 、 uniform shortcoming ： You need to know in advance Distribution of key words Use scenarios ： It is suitable for finding relatively small and continuous cases

2). Division and remainder –( Commonly used )
Set the number of addresses allowed in the hash table to m, Take one not greater than m, But close to or equal to m The prime number of p As divisor , According to the hash function ： Hash(key) = key% p(p<=m), Convert key to hash address

3). Square with the middle method –( understand )
Assume keyword is 1234, Yes, its square is 1522756, Extract the middle 3 position 227 As a hash address ; Another example is that the keyword is 4321, Yes It's Square 18671041, Extract the middle 3 position 671( or 710) As a hash address The square middle method is more suitable for ： I don't know the score of keywords cloth , And the number of digits is not very big

4). Folding method –( understand )
Folding is to divide keywords from left to right into several parts with equal digits ( The last part of the digits can be shorter ), And then add these parts together and sum it up , And press the hash table length , Take the last few bits as the hash address . Folding method is suitable for not knowing the distribution of keywords in advance , It is suitable for the situation with more keyword digits
5). Random number method –( understand )
Choose a random function , Take the random function value of a keyword as its hash address , namely H(key) = random(key), among random Is a random number function . This method is usually used when the keyword length is different
6). Mathematical analysis –( understand )
Equipped with n individual d digit , Everyone may have r Different symbols , this r Different symbols don't always appear the same frequency on you , Maybe somewhere These positions are evenly distributed , Every symbol has an equal chance of appearing , It is unevenly distributed on some bits, and only some symbols often appear . According to the The size of the hash table , Select a number of bits in which various symbols are evenly distributed as the hash address .

Digital analysis method is usually suitable for dealing with large number of key words , If the distribution of keywords is known in advance and several bits of keywords are evenly distributed Even conditions

Be careful ： The more sophisticated the hash function is , The lower the likelihood of hash conflicts , But there is no way to avoid hash conflicts

5, Conflict - avoid - Load factor regulation （ Focus on mastering ）

The load factor of the hash table is defined as : a = The number of elements in the table / The length of the hash table

a It is a marker factor of hash table fullness . Since the meter length is a fixed value ,
a And “ The number of elements in the table ” In direct proportion to , therefore ,a The bigger it is , Indicates that the more elements to fill in the table , The more likely there is to be a conflict : conversely ,a The smaller it is , Indicate that the fewer elements to fill in the table , The less likely there is a conflict . actually , The average lookup length of the hash table is the load factor a Function of , It's just that different methods of dealing with conflicts have different functions .
For open addressing , The load factor is a particularly important factor , It should be strictly limited to 0.7-0.8 following . exceed 0. 8, Look up the table CPU Cache miss (cache,
missing) Follow the exponential curve . therefore , Some use open addressing hash library ,
Such as Java The system library limits the load factor to 0.75, Exceeding this value will resize Hash table .

The relationship between load factor and conflict rate is roughly demonstrated
Insert picture description here
So when the conflict rate is unbearable , We need to reduce the conflict rate by reducing the load factor .

It is known that the number of existing keywords in the hash table is immutable , Then all we can adjust is the size of the array in the hash table .

6, Conflict — solve

Two common ways to resolve hash conflicts are ： Closed hash and Hash

7, Hash — solve — Closed hash

Closed hash ： It's also called open addressing , When a hash conflict occurs , If the hash table is not full , It means that there must be a space in the hash table , Then you can. hold key Stored in a conflict location “ next ” Go to an empty place . So how to find the next empty position ？

1) Linear detection
For example, the scene above , Now you need to insert the element 44, First calculate the hash address through the hash function , Subscript to be 4, therefore 44 In theory, it should be inserted in this Location , But the position has been set to the value 4 The elements of , That is, a hash conflict occurs .

Linear detection ： Start from where the conflict occurred , Sequential backward detection , Until we find the next empty position .

Insert ：

Get the location of the element to be inserted in the hash table through the hash function
If there is no element in the location, insert the new element directly , If there are elements in the location that have hash conflicts , Use linear detection to find Next empty position , Insert new elements

Insert picture description here

When using closed hash to handle hash conflicts , The existing elements in the hash table cannot be deleted randomly , If you delete the element directly, it will affect other
Element search . For example, delete elements 4, If delete directly ,44 The search may be affected . Therefore, the linear detection adopts the standard Remember the pseudo deletion method to delete an element

2). Second detection

The drawback of linear detection is that conflicting data are stacked together , This has something to do with finding the next empty location , Because the way to find an empty position is to Look back one by one , Therefore, in order to avoid this problem , The way to find the next empty location is ： $H_i$ = ( $H_0$ + $i^2$ )% m, perhaps ： $H_i$ = ( $H_0$ - $i^2$ )% m. among ：i = 1,2,3…, $H_0$ Is through the hash function Hash(x) Key to element key The calculated position ,m Is the size of the table . about 2.1 If you want to insert 44, Conflict , The situation after use is ：
Insert picture description here

Studies have shown that ： When the length of the table is prime and the table loading factor a No more than 0.5 when , New entries must be able to insert , And not in any position Will be probed twice . So as long as there are half empty positions in the table , There will be no table full problem . When searching, you can not consider the situation that the table is full condition , However, when inserting, you must ensure that the load factor of the table a No more than 0.5, If it exceeds the limit, we must consider increasing capacity .

therefore ： The biggest disadvantage of hash ratio is that the space utilization is relatively low , This is also the defect of hash .

8, Conflict — solve — Hash / Hash bucket （ Focus on mastering ）

Hash method is also called chain address method ( Open chain method ), First, we use hash function to calculate hash address for key set , Keys with the same address belong to the same child aggregate , Each subset is called a bucket , The elements in each bucket are linked by a single chain table , The head nodes of each linked list are stored in the hash table .

Insert picture description here

As can be seen from the above figure , Each bucket in the hash is filled with elements with hash conflicts .
Hash , It can be considered that a search problem in a large set is transformed into a search problem in a small set .

9, A solution to a serious conflict

We just mentioned , In fact, hash bucket can be regarded as transforming the search problem of large set into the search problem of small set , If the conflict is serious , It means In fact, the search performance of small collections is not good , At this time, we can continue to transform the so-called small set search problem , for example ：
1). Behind each bucket is another hash table
2). Behind each bucket is a search tree

10, Implement a hash table

I use the open hash method to solve the hash conflict problem
I use hash The method of dividing a function and leaving a remainder –( Commonly used )
Here we mainly implement three operations of hash table
1） newly added （ With capacity expansion operation ）
2） lookup
3） Delete

// Hash conflicts are handled by hashing 
public class MyHashMap {
    
    static class Node{
    
        public int key;
        public int value;
        public Node next;

        public Node(int key, int value) {
    
            this.key = key;
            this.value = value;
        }
    }
    private  static  final  double LOAD_FACTOR = 0.75; // Define a load shadow constant / It is used to judge whether capacity expansion is needed 
    //array Namely hash Table body , Each element of the array is also the head node of a linked list 
    private Node[] array = new Node[101];
    private int size = 0; // At present hash The number of elements in the table 
    private int hashFunc( int key){
    
        // Seek the present key stay hash Storage location in the table , The actual solution is much more complicated than this 
        return  key%array.length;
    }

    // If hash It already exists in the table key Just modify the original key Of value
    // Insert new key value pairs if they do not exist 
      public void put(int key,int value){
    
        //1, Need to put key Map to array subscript 
        int index = hashFunc(key);
        // Get the node at this position in the current array （ That is, the head node of the linked list ）
        Node list = array[index];
        // Judgment is key Whether it exists in the linked list 
          for (Node cur = list; cur != null; cur = cur.next){
    
              if(cur.key == value){
    
                  // If it exists, modify the current node value
                  cur.value = value;
                  return;
              }
          }
          // If the cycle ends , Did not find , It is inserted into the subscript of the corresponding array ,
          // Here, the head plug is used , You can also tail insert 
          Node newNode = new Node(key,value);
          newNode.next = list;
          array[index] = newNode;
          index++;
          if(size/array.length > LOAD_FACTOR){
    
              resize(); // Larger than the load factor, the capacity will be expanded 
          }
      }

      private void resize(){
      // Scale operation 
       Node[] newArray = new Node[array.length*2];
          for (int i = 0; i < array.length; i++) {
    
              for(Node cur = array[i]; cur != null; cur = cur.next){
    
                  int index = cur.key%newArray.length;
                  Node newNode = new Node(cur.key,cur.value);
                  newNode.next = newArray[index].next;
                  newArray[index] = newNode;
              }
          }
          array = newArray;
    }
      // Search operation , lookup key Corresponding value, If you find it , Just go back to value, Otherwise return to null
    public Integer get(int key){
    
        // First we have to take key Map to array subscript 
        int index = hashFunc(key);
        // Take out the linked list of the subscript position 
        Node list = array[index];
        // Traverse the linked list to find 
        for (Node cur = list; cur != null; cur = cur.next){
    
            if(cur.key == key){
    
                return cur.value;   // eureka 
            }
        }
        // Did not find 
        return null;
    }
    // Delete key
 public  void remove(int key){
    
        int index = hashFunc(key);
        Node list = array[index];
        if(key == list.value){
     // Handle header deletion 
            array[index] = list.next;
        }
        // In other locations, you need to find the previous node to delete the node 
       Node prveNode = serch(key);
        Node toRemove = prveNode.next;
        prveNode.next = toRemove.next;
 }
      public  Node serch(int key){
      // Find the node before the node to be deleted 
          int index = hashFunc(key);
          Node list = array[index];
        for (Node cur = list; cur != null && cur.next != null; cur = cur.next ){
    
            if(cur.next.key == key){
    
                return cur;
            }
        }
        return null;
      }
}

11, Performance analysis

Although hashtables have been struggling with conflicts , But in actual use , We think the conflict rate of hash table is not high , The number of conflicts is controllable , That is, the length of the list in each bucket is a constant , therefore , In the usual sense , We think that hash table insertion / Delete / The complexity of searching time is O(1) .

原网站

版权声明
本文为[A goblin]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206261801123387.html