当前位置：网站首页>记一次Spark foreachPartition导致OOM

记一次Spark foreachPartition导致OOM

2022-07-25 15:10:00 【南风知我意丿】

文章目录

问题描述
原因分析：
- - - 1.foreachPartition 介绍
    - 2.用了foreachPartition算子之后，好处在哪里？
解决方案：

问题描述

spark streaming 程序线上报错
日志如下：

org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$1(TorrentBroadcast.scala:306)
	at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$blockifyObject$1$adapted(TorrentBroadcast.scala:306)
	at org.apache.spark.broadcast.TorrentBroadcast$$$Lambda$2411/66155661.apply(Unknown Source)
	at org.apache.spark.util.io.ChunkedByteBufferOutputStream.toChunkedByteBuffer(ChunkedByteBufferOutputStream.scala:114)
	at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:315)
	at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:137)
	at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:91)
	at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:35)
	at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:77)
	at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1479)
	at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1223)
	at org.apache.spark.scheduler.DAGScheduler.submitStage(DAGScheduler.scala:1118)
	at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:1061)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2196)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)

原因分析：

根据日志信息定位到代码里的 dataFrame.foreachPartition

1.foreachPartition 介绍

在这里插入图片描述

2.用了foreachPartition算子之后，好处在哪里？

1、对于我们写的function函数，就调用一次，一次传入一个partition所有的数据
2、主要创建或者获取一个数据库连接就可以
3、只要向数据库发送一次SQL语句和多组参数即可
在实际生产环境中，清一色，都是使用foreachPartition操作；但是有个问题，跟mapPartitions操作一样，如果一个partition的数量真的特别特别大，比如真的是100万，那基本上就不太靠谱了。
一下子进来，很有可能会发生OOM，内存溢出的问题。