当前位置：网站首页>Redis data type (VII) -- hyperloglog

Redis data type (VII) -- hyperloglog

2022-06-12 06:18:00 【leo_ messi94】

Preface

At work , The functional requirements related to statistics are often met , For example, Statistics website PV（pageview Page visits ）, have access to reids Of incr、incrby Make it easy .

But like UV（uniquevistor, Independent visitor ）、 Independent IP Count 、 How to solve the problems that need to be de duplicated and counted, such as the number of search records ？ This problem of finding the number of non repeating elements in a set becomes a technical problem

There are many solutions to technical problems ：

The data is stored in mysql In the table , Use distinct count Calculate the number of non duplicates
Use Redis Provided hash、set、bitmaps And other data structures

The results of the above scheme are accurate , But as the data increases , Resulting in more and more occupied space , It is impractical for very large data sets .

Can you reduce the accuracy to balance the storage space ？Redis Push it out HyperLogLog

brief introduction

Redis HyperLogLog Is used to do technical statistics algorithm , Advantage is , When the number or volume of input elements is very, very large , The space needed to calculate the cardinality is always fixed , And it's very small .

stay redis in , Every hyperLogLog Keys only cost 12KB Memory , Note that proximity can be calculated 2^64 Cardinality of different elements . This is the same as calculating the cardinality , The more elements, the more memory consuming collections form a sharp contrast .

however , because HyperLogLog Only the input elements will be used to calculate the cardinality , Instead of storing the input elements of this book , therefore HyperLogLog It can't be like a collection , Return the various elements of the input .

What is the cardinality

Like data sets {1,3,5,7,5,7,8} The cardinality set of is {1,3,5,7,8}, base （ Number of non repeating elements ） by 5. The cardinality is within the acceptable range of error , Fast base calculation .

command ：

1.pfadd： Store parameters other than the first parameter in a variable named with the first parameter HyperLogLog In structure

Format ：PFADD key element [element …]
One side effect of this command is that it may change this HyperLogLog To reflect the cardinality estimated when each unique object is added ( The cardinality of a set ).
If one HyperLogLog The approximate cardinality of the estimate changes during the execution of the command , PFADD return 1, Otherwise return to 0, If specified key non-existent , This command will automatically create an empty HyperLogLog structure （ Specify the length and encoding of the string ）.
It is also possible to provide only the variable name without specifying the element when calling the command , If this variable name exists , There will be no operation , If it doesn't exist , Then a data structure will be created （ return 1）
Return value ： If HyperLogLog The interior of the has been modified , Then the return 1, Otherwise return to 0
Complexity ：O(1)

redis> PFADD hll a b c d e f g
(integer) 1
redis> PFCOUNT hll
(integer) 7

2.pfcount： Return stored in HyperLogLog The approximate cardinality of the variable of the structure

Format ：PFCOUNT key [key …]
When the parameter is a key when , Return stored in HyperLogLog The approximate cardinality of the variable of the structure , If the variable does not exist , Then return to 0.
When Parameters are multiple key when , Back to these HyperLogLog Approximate cardinality of Union , This value will be given to all key Of HyperLoglog The structure is merged into a temporary HyperLogLog Calculated in the structure .
HyperLogLog Can use fixed and very little memory （ Every HyperLogLog The structure requires 12K Bytes plus key A few bytes of itself ） To store the only elements of the collection .
The returned visible set cardinality is not an exact value , But one with 0.81% The standard error （standard error） Approximate value .
For example, to record how many different search queries are executed in a day , A program can call once each time a search query is executed PFADD, And by calling PFCOUNT Command to get the approximate result of this record .
Be careful : One side effect of this command is that it may cause HyperLogLog Internal changed , For caching purposes , It will use 8 Bytes to record the cardinality of the latest calculation , therefore PFCOUNT The command is technically a write command
Return value ： Approximate number of unique elements added
Complexity ：O(1) with every small average constant times when called with a single key. O(N) with N being the number of keys, and much bigger constant times, when called with multiple keys

redis> PFADD hll foo bar zap
(integer) 1
redis> PFADD hll zap zap zap
(integer) 0
redis> PFADD hll foo bar
(integer) 0
redis> PFCOUNT hll
(integer) 3
redis> PFADD some-other-hll 1 2 3
(integer) 1
redis> PFCOUNT hll some-other-hll
(integer) 6

performance

When calling PFCOUNT Command to specify a key Is the parameter , Good performance , Even with dealing with one HyperLogLog It takes as little time . This may be related to PFCOUNT The command can directly use the estimated cardinality of the cache , Most of the PFADD Nor will any registers be updated , So this value is rarely changed . Theoretically, it can achieve hundreds of operations per second .

When calling PFCOUNT Command to specify multiple key, Because there are many HperLogLog Structure to perform a slower merge operation , Moreover, the cardinality obtained by union calculation cannot be cached , PFCOUNT The command also takes milliseconds to execute multiple key Union operation of , It will take a long time , So don't abuse this multiple key The way .

The user needs to understand this command to handle 1 individual key And multiple key The semantics of execution are different , And the performance of the execution is also different .

For more information, please refer to This article . Source code hyperloglog.c The document is also easy to understand , Including sparse and dense implementation of the code .

3.pfmerge： Will be multiple HyperLogLog Merge （merge） For one HyperLogLog

Format ：PFMERGE destkey sourcekey [sourcekey …]
After the merger HyperLogLog The cardinality of is close to all inputs HyperLogLog The visible set of （observed set） Union .
It's a combination of HyperLogLog Will be stored in the target variable （ The first parameter ） Inside , If the key doesn't exist , So before the command is executed , An empty... Will be created for the key first
Return value ：OK
Complexity ：O(N) to merge N HyperLogLogs, but with high constant times

redis> PFADD hll1 foo bar zap a
(integer) 1
redis> PFADD hll2 a b c foo
(integer) 1
redis> PFMERGE hll3 hll1 hll2
OK
redis> PFCOUNT hll3
(integer) 6

原网站

版权声明
本文为[leo_ messi94]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203010611073190.html