当前位置:网站首页>Spark accumulators and broadcast variables
Spark accumulators and broadcast variables
2022-06-24 07:00:00 【Angryshark_ one hundred and twenty-eight】
accumulator
Accumulators are somewhat similar Redis The counter of , But it's more powerful than a counter , Not only can it be used to count , It can also be used to accumulate and sum 、 Accumulate and merge elements, etc .
Suppose we have a word.txt Text , We want to count the words in the text “sheep” The number of rows , We can read the text directly filter Filter and count .
sc.textFile("word.txt").filter(_.contains("sheep")).count()
Suppose we want to count the words in the text separately "sheep""wolf" The number of rows , If it needs to be calculated twice according to the above method
sc.textFile("word.txt").filter(_.contains("sheep")).count()
sc.textFile("word.txt").filter(_.contains("wolf")).count()
If you want to make statistics separately 100 Lines of words , Then calculate 100 Time
If an accumulator is used , You only need to read it once
val count1=sc.acccumlator(0)
val count2=sc.acccumlator(0)
...
def processLine(line:String):Unit{
if(line.contains("sheep")){
count1+=1
}
if(line.contains("wolf")){
count2+=1
}
...
}
sc.textFile("word.txt").foreach(processLine(_))
Not only Int Types can be accumulated ,Long、Double、Collection You can also add up , You can also customize , And this variable can be in Spark Of WebUI See the interface .
Be careful : Accumulator can only be Driver End definition and reading , Can't be in Executor End read .
Broadcast variables
Broadcast variables allow caching of a read-only variable on each machine (worker) above , Not every task (task) Save a backup . Using broadcast variables, a copy of a large data input set can be allocated to each node in a more efficient way .
Broadcast variables improve data sharing efficiency in two ways :
(1) Each node in the cluster ( Physical machines ) There is only one copy , The default closure is a copy of each task ;
(2) Broadcast transmission is through BT Download mode , That is to say P2P download , When there are many clusters , It can greatly improve the data transmission rate . After the broadcast variable is modified , No feedback to other nodes .
val list=sc.parallize(0 to 10)
val brdList=sc.broadcast(list)
sc.textFile("test.txt").filter(brdList.value.contains(_.toInt)).foreach(println)
When using , Attention should be paid to :
(1) Apply to Small variable distribution , For motion, there are dozens of M The variable of , Each task is sent once, which consumes memory , It's a waste of time
(2) Broadcast variables can only be driver End definition , stay Executor End read ,Executor Do not modify
边栏推荐
猜你喜欢

成为 TD Hero,做用技术改变世界的超级英雄 | 来自 TDengine 社区的邀请函
![[JUC series] completionfuture of executor framework](/img/d0/c26c9b85d1c1b0da4f1a6acc6d33e3.png)
[JUC series] completionfuture of executor framework

基于三维GIS系统的智慧水库管理应用

Application of intelligent reservoir management based on 3D GIS system

文件系统笔记

RealNetworks vs. Microsoft: the battle in the early streaming media industry

You have a chance, here is a stage

云上本地化运营,东非第一大电商平台Kilimall的出海经

面渣逆袭:MySQL六十六问,两万字+五十图详解

应用配置管理,基础原理分析
随机推荐
[binary number learning] - Introduction to trees
On BOM and DOM (3): DOM node operation - element style modification and DOM content addition, deletion, modification and query
面渣逆袭:MySQL六十六问,两万字+五十图详解
雲監控系統 HertzBeat v1.1.0 發布,一條命令開啟監控之旅!
How to register the cloud service platform and what are the advantages of cloud server
Thread safety and its implementation
Record -- about the problem of garbled code when JSP foreground passes parameters to the background
Easy car Interviewer: talk about MySQL memory structure, index, cluster and underlying principle!
RealNetworks vs. Microsoft: the battle in the early streaming media industry
Deploy DNS server using dnsmasq
The three-year action plan of the Ministry of industry and information technology has been announced, and the security industry has ushered in major development opportunities!
How to make a website? What should I pay attention to when making a website?
leetcode:1856. Maximum value of minimum product of subarray
When the VPC main network card has multiple intranet IP addresses, the server cannot access the network internally, but the server can be accessed externally. How to solve this problem
Record -- about the method of adding report control to virtual studio2017 -- reportview control
如何低成本构建一个APP
Online font converter what is the meaning of font conversion
Internet cafe management system and database
Game website making tutorial and correct view of games
【二叉数学习】—— 树的介绍