当前位置:网站首页>对spark算子aggregateByKey的理解
对spark算子aggregateByKey的理解
2022-07-28 06:18:00 【hzp666】
对spark算子aggregateByKey的理解

案例
aggregateByKey算子其实相当于是针对不同“key”数据做一个map+reduce规约的操作。
举一个简单的在生产环境中的一段代码
有一些整理好的日志字段,经过处理得到了RDD类型为(String,(String,String))的List格式结果,其中各个String代表的是:(用户名,(访问时间,访问页面url))
同一个用户可能在不同的时间访问了不同或相同的页面,为了合并同一个用户的访问行为,写了下面这段代码,用到aggregateByKey。
val data = sc.parallelize(
List(
("13909029812",("20170507","http://www.baidu.com")),("18089376778",("20170401","http://www.google.com")),("18089376778",("20170508","http://www.taobao.com")),("13909029812",("20170507","http://www.51cto.com"))
)
)
data.aggregateByKey(scala.collection.mutable.Set[(String, String)](), 200)((set, item) => {
set += item
}, (set1, set2) => set1 union set2).mapValues(x => x.toIterable).collect
结果:
res12: Array[(String, Iterable[(String, String)])] = Array((18089376778,Set((20170401,http://www.google.com), (20170508,http://www.taobao.com))), (13909029812,Set((20170507,http://www.51cto.com), (20170507,http://www.baidu.com))))
分解分析:##
aggregateByKey(参数1)(参数2,参数3)
过程:对于data的某个key,参数1为初始化值,在参数2的函数中,初始值和该key的每一个value传入函数进行操作,所有返回的结果在参数3中进行规约。
- 参数1
scala.collection.mutable.Set[(String, String)]()
new 了一个空的set集合,做为初始值
参数2
(set, item) => {
set += item
}
一个类似于map的映射函数,将该key的每一个value(在本案例之是(访问时间,访问url))作为item,将其放入set中并返回。
可知某个key的所有value都会返回一个含有该value的set参数3
(set1, set2) => set1 union set2
该key的所有value得到的set进行union规约。并返回
最终结果:得到了每一个用户在所有时间的访问url的行为信息。
原文:https://www.jianshu.com/p/09912beb1350
边栏推荐
- Which of class A and class B is more stringent in EMC?
- Awk from introduction to earth (16) discussion on the types of awk variables -- about the two types of numbers and strings
- DNA modified osmium OS nanoparticles osnps DNA modified iridium nanoparticles irnps DNA
- DNA cuinseqds near infrared CuInSe quantum dots wrapped deoxyribonucleic acid DNA
- Why is ESD protection so important for integrated circuits? How to protect?
- EMC问题的根源在哪?
- 非关系型数据库之Redis【redis集群详细搭建】
- 【活动报名】云原生技术交流 Meetup,8 月 6 日广州见
- Elaborate on common mode interference and differential mode interference
- EMC's "don't come back until you rectify"
猜你喜欢

华为交换机拆解,学EMC基本操作

Why is ESD protection so important for integrated circuits? How to protect?

(daily question) - the longest substring without repeated characters

铜铟硫CuInSe2量子点修饰DNA(脱氧核糖核酸)DNA-CuInSe2QDs(齐岳)

Elaborate on common mode interference and differential mode interference

CAS vs Database optimistic lock

ArcGIS JS customizes the accessor and uses the watchutils related method to view the attribute

Disassemble Huawei switches and learn Basic EMC operations

常用电子产品行业标准及认证

【干货】32个EMC标准电路分享!
随机推荐
Awk from introduction to earth (16) discussion on the types of awk variables -- about the two types of numbers and strings
演讲笔记 适合所有人的实用程序生成 PCG
Niuke MySQL - SQL must know and know
[dry goods] 32 EMC standard circuits are shared!
Digital management insight into retail and e-commerce operations -- Introduction to digital management
0727~ sorting out interview questions
EMC整改思路
Pytorch的冻结以及解冻
Industry standards and certification of common electronic products
Collector原理解析
MySQL basic knowledge learning (II)
[solution] visual full link log tracking - log tracking system
Yaml parameter configuration based on singleton mode
【google】解决google浏览器不弹出账号密码保存框且无法保存登录信息问题
C language explanation series - array explanation, one-dimensional array, two-dimensional array
YOLO系列损失函数详解
EMC design strategy - clock
CLion调试redis6源码
ESD静电不用怕,本文告诉你一些解决方法
【花书笔记】 之 Chapter01 引言