当前位置:网站首页>spark operator - map vs mapPartitions operator
spark operator - map vs mapPartitions operator
2022-08-05 06:11:00 【zdaiqing】
map vs mapPartitions
1.源码
1.1.map算子源码
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
1.2.mapPartitions算子源码
def mapPartitions[U: ClassTag](
f: Iterator[T] => Iterator[U],
preservesPartitioning: Boolean = false): RDD[U] = withScope {
val cleanedF = sc.clean(f)
new MapPartitionsRDD(
this,
(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
preservesPartitioning)
}
1.3.对比
相似
map算子和mapPartitionsThe bottom layer of the operator is the componentMapPartitionsRDD
区别
functional aspects
mapThe function of the function passed in by the operator,Is to process one element and return another element;
mapPartitionsThe function of the function passed in by the operator,is an iterator(一批元素)Returns another iterator after processing(一批元素);
function execution
map算子中,An iterator contains all the elements that the operator needs to process,有多少个元素,The incoming function is executed as many times as possible;
mapPartitions算子中,An iterator contains all elements in a partition,The function processes the data one iterator at a time,That is, one partition calls the function once;
1.4.Validation of execution times
代码
val lineSeq = Seq(
"hello me you her",
"hello you her",
"hello her",
"hello"
)
val rdd = sc.parallelize(lineSeq, 2)
.flatMap(_.split(" "))
println("===========mapPartitions.start==============")
rdd.mapPartitions(iter => {
println("mp+1")
iter.map(x =>
(x, 1)
)
}).collect()
println("===========map.start==============")
rdd.map(x => {
println("mp+2")
(x, 1)
}).collect()
执行结果
2.特点
map算子
有多少元素,How many times the function is executed
mapPartitions算子
有多少分区,How many times the function is executed
The function parameter is an iterator,返回的也是一个迭代器
3.使用场景分析
When performing simple element representation transformation operations,建议使用map算子,避免使用mapPartitions算子:
mapPartitionsThe function needs to return an iterator,When dealing with transformation operations on simple element representations,An intermediate cache is required to store the processing results,It is then converted to an iterator cache;这个情况下,The intermediate cache is stored in memory,If there are more elements to be processed in the iterator,容易引起OOM;
In scenarios of resource initialization overhead and batch processing in the case of large datasets,建议使用mapPartitions算子:
基于sparkCharacteristics of distributed execution operators,Each partition requires a separate resource initialization;mapPartitionsThe advantage of executing a function only once for a partition can realize that a partition needs only one resource initialization(eg:Scenarios that require database linking);
4.参考资料
Spark系列——关于 mapPartitions的误区
Spark—算子调优之MapPartitions提升Map类操作性能
Learning Spark——Spark连接Mysql、mapPartitions高效连接HBase
mapPartition
边栏推荐
- The spark operator - coalesce operator
- 【机器学习】1单变量线性回归
- Call the TensorFlow Objection Detection API for object detection and save the detection results locally
- [Paper Intensive Reading] Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation (R-CNN)
- Getting Started Documentation 12 webserve + Hot Updates
- 添加新硬盘为什么扫描不上?如何解决?
- 交换机原理
- Why can't I add a new hard disk to scan?How to solve?
- 网站ICP备案是什么呢?
- IJCAI 2022|Boundary-Guided Camouflage Object Detection Model BGNet
猜你喜欢

dsf5.0 弹框点确定没有返回值的问题

错误类型:reflection.ReflectionException: Could not set property ‘xxx‘ of ‘class ‘xxx‘ with value ‘xxx‘

Unity物理引擎中的碰撞、角色控制器、Cloth组件(布料)、关节 Joint

Getting Started Doc 08 Conditional Plugins

入门文档09 独立的watch

spark源码-任务提交流程之-1-sparkSubmit

Contextual non-local alignment of full-scale representations

Getting Started Document 01 series in order

入门文档11 自动添加版本号

NIO工作方式浅析
随机推荐
D41_缓冲池
什么?CDN缓存加速只适用于加速静态内容?
传输层协议(TCP3次握手)
spark源码-任务提交流程之-4-container中启动executor
【Day8】RAID Disk Array
VRRP principle and command
【机器学习】1单变量线性回归
D39_向量
spark算子-wholeTextFiles算子
虚幻引擎5都有哪些重要新功能?
spark operator-wholeTextFiles operator
Spark source code - task submission process - 4-container to start executor
硬盘分区和永久挂载
成功的独立开发者应对失败&冒名顶替综
Hard Disk Partitioning and Permanent Mounting
网络布线与数制转换
URP渲染管线实战教程系列 之URP渲染管线实战解密(一)
入门文档10 资源映射
dsf5.0 弹框点确定没有返回值的问题
2020年手机上最好的25种免费游戏