当前位置:网站首页>Redis data type (VII) -- hyperloglog

Redis data type (VII) -- hyperloglog

2022-06-12 06:18:00 leo_ messi94

Preface

At work , The functional requirements related to statistics are often met , For example, Statistics website PV(pageview Page visits ), have access to reids Of incr、incrby Make it easy .

But like UV(uniquevistor, Independent visitor )、 Independent IP Count 、 How to solve the problems that need to be de duplicated and counted, such as the number of search records ? This problem of finding the number of non repeating elements in a set becomes a technical problem

There are many solutions to technical problems :

  1. The data is stored in mysql In the table , Use distinct count Calculate the number of non duplicates
  2. Use Redis Provided hash、set、bitmaps And other data structures

The results of the above scheme are accurate , But as the data increases , Resulting in more and more occupied space , It is impractical for very large data sets .

Can you reduce the accuracy to balance the storage space ?Redis Push it out HyperLogLog

brief introduction

Redis HyperLogLog Is used to do technical statistics algorithm , Advantage is , When the number or volume of input elements is very, very large , The space needed to calculate the cardinality is always fixed , And it's very small .

stay redis in , Every hyperLogLog Keys only cost 12KB Memory , Note that proximity can be calculated 2^64 Cardinality of different elements . This is the same as calculating the cardinality , The more elements, the more memory consuming collections form a sharp contrast .

however , because HyperLogLog Only the input elements will be used to calculate the cardinality , Instead of storing the input elements of this book , therefore HyperLogLog It can't be like a collection , Return the various elements of the input .

What is the cardinality

Like data sets {1,3,5,7,5,7,8} The cardinality set of is {1,3,5,7,8}, base ( Number of non repeating elements ) by 5. The cardinality is within the acceptable range of error , Fast base calculation .

command :

1.pfadd: Store parameters other than the first parameter in a variable named with the first parameter HyperLogLog In structure

  • Format :PFADD key element [element …]
  • One side effect of this command is that it may change this HyperLogLog To reflect the cardinality estimated when each unique object is added ( The cardinality of a set ).
  • If one HyperLogLog The approximate cardinality of the estimate changes during the execution of the command , PFADD return 1, Otherwise return to 0, If specified key non-existent , This command will automatically create an empty HyperLogLog structure ( Specify the length and encoding of the string ).
  • It is also possible to provide only the variable name without specifying the element when calling the command , If this variable name exists , There will be no operation , If it doesn't exist , Then a data structure will be created ( return 1)
  • Return value : If HyperLogLog The interior of the has been modified , Then the return 1, Otherwise return to 0
  • Complexity :O(1)
redis> PFADD hll a b c d e f g
(integer) 1
redis> PFCOUNT hll
(integer) 7

2.pfcount: Return stored in HyperLogLog The approximate cardinality of the variable of the structure

  • Format :PFCOUNT key [key …]
  • When the parameter is a key when , Return stored in HyperLogLog The approximate cardinality of the variable of the structure , If the variable does not exist , Then return to 0.
  • When Parameters are multiple key when , Back to these HyperLogLog Approximate cardinality of Union , This value will be given to all key Of HyperLoglog The structure is merged into a temporary HyperLogLog Calculated in the structure .
  • HyperLogLog Can use fixed and very little memory ( Every HyperLogLog The structure requires 12K Bytes plus key A few bytes of itself ) To store the only elements of the collection .
  • The returned visible set cardinality is not an exact value , But one with 0.81% The standard error (standard error) Approximate value .
  • For example, to record how many different search queries are executed in a day , A program can call once each time a search query is executed PFADD, And by calling PFCOUNT Command to get the approximate result of this record .
  • Be careful : One side effect of this command is that it may cause HyperLogLog Internal changed , For caching purposes , It will use 8 Bytes to record the cardinality of the latest calculation , therefore PFCOUNT The command is technically a write command
  • Return value : Approximate number of unique elements added
  • Complexity :O(1) with every small average constant times when called with a single key. O(N) with N being the number of keys, and much bigger constant times, when called with multiple keys
redis> PFADD hll foo bar zap
(integer) 1
redis> PFADD hll zap zap zap
(integer) 0
redis> PFADD hll foo bar
(integer) 0
redis> PFCOUNT hll
(integer) 3
redis> PFADD some-other-hll 1 2 3
(integer) 1
redis> PFCOUNT hll some-other-hll
(integer) 6

performance

When calling PFCOUNT Command to specify a key Is the parameter , Good performance , Even with dealing with one HyperLogLog It takes as little time . This may be related to PFCOUNT The command can directly use the estimated cardinality of the cache , Most of the PFADD Nor will any registers be updated , So this value is rarely changed . Theoretically, it can achieve hundreds of operations per second .

When calling PFCOUNT Command to specify multiple key, Because there are many HperLogLog Structure to perform a slower merge operation , Moreover, the cardinality obtained by union calculation cannot be cached , PFCOUNT The command also takes milliseconds to execute multiple key Union operation of , It will take a long time , So don't abuse this multiple key The way .

The user needs to understand this command to handle 1 individual key And multiple key The semantics of execution are different , And the performance of the execution is also different .

For more information, please refer to This article . Source code hyperloglog.c The document is also easy to understand , Including sparse and dense implementation of the code .

3.pfmerge: Will be multiple HyperLogLog Merge (merge) For one HyperLogLog

  • Format :PFMERGE destkey sourcekey [sourcekey …]
  • After the merger HyperLogLog The cardinality of is close to all inputs HyperLogLog The visible set of (observed set) Union .
  • It's a combination of HyperLogLog Will be stored in the target variable ( The first parameter ) Inside , If the key doesn't exist , So before the command is executed , An empty... Will be created for the key first
  • Return value :OK
  • Complexity :O(N) to merge N HyperLogLogs, but with high constant times
redis> PFADD hll1 foo bar zap a
(integer) 1
redis> PFADD hll2 a b c foo
(integer) 1
redis> PFMERGE hll3 hll1 hll2
OK
redis> PFCOUNT hll3
(integer) 6
原网站

版权声明
本文为[leo_ messi94]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203010611073190.html