当前位置:网站首页>Spark累加器和广播变量
Spark累加器和广播变量
2022-06-24 06:39:00 【Angryshark_128】
累加器
累加器有些类似Redis的计数器,但要比计数器强大,不仅可以用于计数,还可以用来累加求和、累加合并元素等。
假设我们有一个word.txt文本,我们想要统计该文本中单词“sheep”的行数,我们可以直接读取文本filter过滤然后计数。
sc.textFile("word.txt").filter(_.contains("sheep")).count()
假设我们想分别统计文本中单词"sheep""wolf"的行数,如果按照上述方法需要计算两次
sc.textFile("word.txt").filter(_.contains("sheep")).count()
sc.textFile("word.txt").filter(_.contains("wolf")).count()
如果要分别统计100个单词的行数,则要计算100次
如果使用累加器,则只需要读一次即可
val count1=sc.acccumlator(0)
val count2=sc.acccumlator(0)
...
def processLine(line:String):Unit{
if(line.contains("sheep")){
count1+=1
}
if(line.contains("wolf")){
count2+=1
}
...
}
sc.textFile("word.txt").foreach(processLine(_))
不仅Int类型可以累加,Long、Double、Collection也可以累加,还可以进行自定义,而且这个变量可以在Spark的WebUI界面看到。
注意:累加器只能在Driver端定义和读取,不能在Executor端读取。
广播变量
广播变量允许缓存一个只读的变量在每台机器(worker)上面,而不是每个任务(task)保存一份备份。利用广播变量能够以一种更有效率的方式将一个大数据量输入集合的副本分配给每个节点。
广播变量通过两个方面提高数据共享效率:
(1)集群中每个节点(物理机器)只有一个副本,默认的闭包是每个任务一个副本;
(2)广播传输是通过BT下载模式实现的,也就是P2P下载,在集群多的情况下,可以极大地提高数据传输速率。广播变量修改后,不会反馈到其他节点。
val list=sc.parallize(0 to 10)
val brdList=sc.broadcast(list)
sc.textFile("test.txt").filter(brdList.value.contains(_.toInt)).foreach(println)
使用时,需注意:
(1)适用于小变量分发,对于动则几十M的变量,每个任务都发送一次既消耗内存,也浪费时间
(2)广播变量只能在driver端定义,在Executor端读取,Executor不能修改
边栏推荐
- On BOM and DOM (1): overview of BOM and DOM
- Application of intelligent reservoir management based on 3D GIS system
- Working principle of online video server selection method for online video platform
- Why does the remote end receive a check-out notice when the TRTC applet turns off audio and video locally
- Easyscreen live streaming component pushes RTSP streams to easydarwin for operation process sharing
- Virtual file system
- Overview of cloud computing advantages of using cloud computing
- 35岁危机?内卷成程序员代名词了
- Deploy DNS server using dnsmasq
- 程序员使用个性壁纸
猜你喜欢

数据同步工具 DataX 已经正式支持读写 TDengine

puzzle(019.1)Hook、Gear

leetcode:1856. Maximum value of minimum product of subarray

leetcode:1856. 子数组最小乘积的最大值

C语言学生管理系统——可检查用户输入合法性,双向带头循环链表
![[JUC series] completionfuture of executor framework](/img/d0/c26c9b85d1c1b0da4f1a6acc6d33e3.png)
[JUC series] completionfuture of executor framework

leetcode:84. The largest rectangle in the histogram

oracle sql综合运用 习题

Interpreting top-level design of AI robot industry development

文件系统笔记
随机推荐
【问题解决】虚拟机配置静态ip
开源与创新
Overview of new features in mongodb5.0
leetcode:84. 柱状图中最大的矩形
Nine unique skills of Huawei cloud low latency Technology
On BOM and DOM (1): overview of BOM and DOM
Virtual file system
leetcode:84. The largest rectangle in the histogram
Tencent launched the "reassuring agricultural product plan" to support 100 landmark agricultural product brands!
Koa source code analysis
On BOM and DOM (2): DOM node hierarchy / attributes / Selectors / node relationships / detailed operation
What are the easy-to-use character recognition software? Which are the mobile terminal and PC terminal respectively
go 断点续传
Localized operation on cloud, the sea going experience of kilimall, the largest e-commerce platform in East Africa
华为云低时延技术的九大绝招
On BOM and DOM (3): DOM node operation - element style modification and DOM content addition, deletion, modification and query
Go excel export tool encapsulation
File system notes
【JUC系列】Executor框架之CompletionFuture
机器人迷雾之算力与智能