当前位置:网站首页>A complete tutorial for getting started with redis: hyperloglog
A complete tutorial for getting started with redis: hyperloglog
2022-07-04 22:53:00 【Gu Ge academic】
HyperLogLog It's not a new data structure ( The actual type is string type ), and
Is a cardinality algorithm , adopt HyperLogLog You can use very little memory space to complete the independent total number
The statistics of , Data sets can be IP、Email、ID etc. .HyperLogLog Provides 3 An order :
pfadd、pfcount、pfmerge. for example 2016-03-06 The access user is uuid-1、uuid-2、
uuid-3、uuid-4,2016-03-05 The access user is uuid-4、uuid-5、uuid-6、uuid-7, Such as
chart 3-15 Shown .

Be careful
HyperLogLog The algorithm is composed of Philippe
Flajolet(https://en.wikipedia.org/wiki/Philippe_Flajolet) stay The analysis of a
near-optimal cardinality estimation algorithm In this paper , If the reader is interested
You can read it yourself .
1. add to
pfadd key element [element … ]
pfadd Used to direct to HyperLogLog Additive elements , If the addition is successful, return 1:
127.0.0.1:6379> pfadd 2016_03_06:unique:ids "uuid-1" "uuid-2" "uuid-3" "uuid-4"
(integer) 1
2. Count the number of independent users
pfcount key [key … ]
pfcount Used to calculate one or more HyperLogLog Independent total of , for example
2016_03_06:unique:ids The total number of independents is 4:
127.0.0.1:6379> pfcount 2016_03_06:unique:ids
(integer) 4
If at this time to 2016_03_06:unique:ids Insert uuid-1、uuid-2、uuid-3、uuid-
90, The result is 5( newly added uuid-90):
127.0.0.1:6379> pfadd 2016_03_06:unique:ids "uuid-1" "uuid-2" "uuid-3" "uuid-90"
(integer) 1
127.0.0.1:6379> pfcount 2016_03_06:unique:ids
(integer) 5
At present, the effect of memory saving in this example is not very obvious , Now use the script to
HyperLogLog Insert 100 m id, Record before inserting info memory:
127.0.0.1:6379> info memory
# Memory
used_memory:835144
used_memory_human:815.57K
... towards 2016_05_01:unique:ids Insert 100 Million users , Insert... Every time 1000 strip :
elements=""
key="2016_05_01:unique:ids"
for i in `seq 1 1000000`
do
elements="${elements} uuid-"${i}
if [[ $((i%1000)) == 0 ]];
then
redis-cli pfadd ${key} ${elements}
elements=""
fi
done
When the above code is executed , You can see that the memory is only increased 15K about :
127.0.0.1:6379> info memory
# Memory
used_memory:850616
used_memory_human:830.68K
however , And you can see that pfcount The result is not 100 ten thousand :
127.0.0.1:6379> pfcount 2016_05_01:unique:ids
(integer) 1009838
It can be done to 100 m uuid Use the collection type to test , The code is as follows :
elements=""
key="2016_05_01:unique:ids:set"
for i in `seq 1 1000000`
do
elements="${elements} "${i}
if [[ $((i%1000)) == 0 ]];
then
redis-cli sadd ${key} ${elements}
elements=""
fi
done
You can see the memory usage 84MB:
127.0.0.1:6379> info memory
# Memory
used_memory:88702680
used_memory_human:84.59M
But the number of independent users is 100 ten thousand :
127.0.0.1:6379> scard 2016_05_01:unique:ids:set
(integer) 1000000
surface 3-6 Lists the use of set types and HperLogLog Count the occupied space of millions of users
Than .

You can see ,HyperLogLog Memory footprint is surprisingly small , But with such a small space
Such a huge amount of data , It must not be 100% Right , There must be an error rate .Redis Officer,
The number given by Fang is 0.81% The error rate of .
3. Merge
pfmerge destkey sourcekey [sourcekey ...]
pfmerge You can find multiple HyperLogLog And assign to destkey, For example, to calculate
2016 year 3 month 5 Day and 3 month 6 Number of independent users per day , This can be done as follows , Sure
See that the number of end independent users is 7:
127.0.0.1:6379> pfadd 2016_03_06:unique:ids "uuid-1" "uuid-2" "uuid-3" "uuid-4"
(integer) 1
127.0.0.1:6379> pfadd 2016_03_05:unique:ids "uuid-4" "uuid-5" "uuid-6" "uuid-7"
(integer) 1
127.0.0.1:6379> pfmerge 2016_03_05_06:unique:ids 2016_03_05:unique:ids
2016_03_06:unique:ids
OK
127.0.0.1:6379> pfcount 2016_03_05_06:unique:ids
(integer) 7
HyperLogLog Memory footprint is very small , But there is an error rate , Developers are working on data
During structure selection, only the following two items need to be confirmed :
· Just to calculate the independent total , No need to get a single piece of data .
· Can tolerate a certain error rate , After all HyperLogLog It has a great advantage in the occupation of memory
potential .
边栏推荐
- Analysis of the self increasing and self decreasing of C language function parameters
- About stack area, heap area, global area, text constant area and program code area
- [the 2023 autumn recruitment of MIHA tour] open [the only exclusive internal push code of school recruitment eytuc]
- Sobel filter
- 攻防世界 MISC 进阶区 Ditf
- Why is Dameng data called the "first share" of domestic databases?
- 测试必会:BUG的分类及推进解决
- The Sandbox 和数字好莱坞达成合作,通过人力资源开发加速创作者经济的发展
- 繁华落尽、物是人非:个人站长该何去何从
- POM in idea XML dependency cannot be imported
猜你喜欢

UML diagram memory skills

新版判断PC和手机端代码,手机端跳转手机端,PC跳转PC端最新有效代码

The overview and definition of clusters can be seen at a glance

Google Earth engine (GEE) - tasks upgrade enables run all to download all images in task types with one click

攻防世界 misc 高手进阶区 a_good_idea

共创软硬件协同生态:Graphcore IPU与百度飞桨的“联合提交”亮相MLPerf

The Sandbox 和数字好莱坞达成合作,通过人力资源开发加速创作者经济的发展

Google Earth engine (GEE) - globfire daily fire data set based on mcd64a1
![[machine learning] handwritten digit recognition](/img/26/cabdc5c92035181d82f6f809e6df0f.png)
[machine learning] handwritten digit recognition

Logo Camp d'entraînement section 3 techniques créatives initiales
随机推荐
High school physics: linear motion
About stack area, heap area, global area, text constant area and program code area
Redis入门完整教程:Bitmaps
Google Earth engine (GEE) - tasks upgrade enables run all to download all images in task types with one click
【lua】int64的支持
Logo special training camp Section IV importance of font design
Business is too busy. Is there really no reason to have time for automation?
Redis introduction complete tutorial: slow query analysis
PMO: compare the sample efficiency of 25 molecular optimization methods
The overview and definition of clusters can be seen at a glance
Redis入门完整教程:客户端通信协议
Lost in the lock world of MySQL
Attack and defense world misc advanced area Hong
Taobao commodity review API interface (item_review get Taobao commodity review API interface), tmall commodity review API interface
质量体系建设之路的分分合合
Hit the core in the advanced area of misc in the attack and defense world
Unity Xiuxian mobile game | Lua dynamic sliding function (specific implementation of three source codes)
攻防世界 MISC 进阶区 Ditf
Why is Dameng data called the "first share" of domestic databases?
Test will: bug classification and promotion solution