当前位置:网站首页>记一次Spark foreachPartition导致OOM
记一次Spark foreachPartition导致OOM
2022-07-25 15:10:00 【南风知我意丿】
问题描述
spark streaming 程序线上报错日志如下:
org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$1(TorrentBroadcast.scala:306)
at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$1$adapted(TorrentBroadcast.scala:306)
at org.apache.spark.broadcast.TorrentBroadcast$$$Lambda$2411/66155661.apply(Unknown Source)
at org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:114)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:315)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:91)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1479)
at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1223)
at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1118)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1061)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2196)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
原因分析:
根据日志信息 定位到代码里的
dataFrame.foreachPartition
1.foreachPartition 介绍

2.用了foreachPartition算子之后,好处在哪里?
1、对于我们写的function函数,就调用一次,一次传入一个partition所有的数据
2、主要创建或者获取一个数据库连接就可以
3、只要向数据库发送一次SQL语句和多组参数即可
在实际生产环境中,清一色,都是使用foreachPartition操作;但是有个问题,跟mapPartitions操作一样,如果一个partition的数量真的特别特别大,比如真的是100万,那基本上就不太靠谱了。
一下子进来,很有可能会发生OOM,内存溢出的问题。
解决方案:
1.程序加内存
边栏推荐
- Scala111-map、flatten、flatMap
- System. Accessviolationexception: an attempt was made to read or write to protected memory. This usually indicates that other memory is corrupted
- 一个程序最多可以使用多少内存?
- C语言函数复习(传值传址【二分查找】,递归【阶乘,汉诺塔等】)
- 继承的实现过程及ES5和ES6实现的区别
- 图片的懒加载
- pkg_ Resources dynamic loading plug-in
- TypeScript学习2——接口
- 简易轮播图和打地鼠
- 【JS高级】js之正则相关函数以及正则对象_02
猜你喜欢
随机推荐
Leetcode combination sum + pruning
ESXI6.7.0 升级到7.0U3f(2022年7月12 更新)
Add the jar package under lib directory to the project in idea
C#,C/S升级更新
js URLEncode函数
树莓派入门:树莓派的初始设置
"Ask every day" briefly talk about JMM / talk about your understanding of JMM
Spark002 --- spark task submission, pass JSON as a parameter
Splice a field of the list set into a single string
Scala110-combineByKey
EDA chip design solution based on AMD epyc server
处理ORACLE死锁
【JS高级】js之正则相关函数以及正则对象_02
PHP 通过原生CURL实现非阻塞(并发)请求模式
Scala111-map、flatten、flatMap
Share a department design method that avoids recursion
图片的懒加载
安装EntityFramework方法
《三子棋》C语言数组应用 --n皇后问题雏形
我的创作纪念日









