当前位置:网站首页>记一次Spark foreachPartition导致OOM
记一次Spark foreachPartition导致OOM
2022-07-25 15:10:00 【南风知我意丿】
问题描述
spark streaming 程序线上报错日志如下:
org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$1(TorrentBroadcast.scala:306)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$1$adapted(TorrentBroadcast.scala:306)
at org.apache.spark.broadcast.TorrentBroadcast$$$Lambda$2411/66155661.apply(Unknown Source)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:114)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:315)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:91)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1479)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1223)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1118)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2196)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
原因分析:
根据日志信息 定位到代码里的
dataFrame.foreachPartition
1.foreachPartition 介绍

2.用了foreachPartition算子之后,好处在哪里?
1、对于我们写的function函数,就调用一次,一次传入一个partition所有的数据
2、主要创建或者获取一个数据库连接就可以
3、只要向数据库发送一次SQL语句和多组参数即可
在实际生产环境中,清一色,都是使用foreachPartition操作;但是有个问题,跟mapPartitions操作一样,如果一个partition的数量真的特别特别大,比如真的是100万,那基本上就不太靠谱了。
一下子进来,很有可能会发生OOM,内存溢出的问题。
解决方案:
1.程序加内存
边栏推荐
- dpdk 收发包问题案例:使用不匹配的收发包函数触发的不收包问题定位
- ESXI6.7.0 升级到7.0U3f(2022年7月12 更新)
- 35 quick format code
- Detailed explanation of lio-sam operation process and code
- oracle_ 12505 error resolution
- Solve the error caused by too large file when uploading file by asp.net
- Sublimetext-win10 cursor following problem
- 37 element mode (inline element, block element, inline block element)
- SublimeText-win10光标跟随问题
- Bridge NF call ip6tables is an unknown key exception handling
猜你喜欢
随机推荐
Client error: invalid param endpoint is blank
[C topic] Li Kou 88. merge two ordered arrays
pkg_resources动态加载插件
什么是物联网
Award winning interaction | 7.19 database upgrade plan practical Summit: industry leaders gather, why do they come?
Visual Studio 2022 查看类关系图
Universal smart JS form verification
LeetCode_因式分解_简单_263.丑数
Scala110-combineByKey
Vs2010添加wap移动窗体模板
41 picture background synthesis - colorful navigation map
golang复习总结
基于OpenCV和YOLOv3的目标检测实例应用
41 图片背景综合-五彩导航图
简易轮播图和打地鼠
Object.prototype.hasOwnProperty() 和 in
任务、微任务、队列和调度(动画展示每一步调用)
如何解决Visual Studio中scanf编译报错的问题
打开虚拟机时出现VMware Workstation 未能启动 VMware Authorization Service
(original) customize a scrolling recyclerview









