当前位置:网站首页>Remember error scheduler once Asynceventqueue: dropping event from queue shared causes OOM
Remember error scheduler once Asynceventqueue: dropping event from queue shared causes OOM
2022-07-29 02:24:00 【The south wind knows what I mean】
List of articles
Problem description
journal :
2022-07-23 01:03:40 ERROR scheduler.AsyncEventQueue: Dropping event from queue shared. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
2022-07-23 01:03:40 WARN scheduler.AsyncEventQueue: Dropped 1 events from shared since the application started.
2022-07-23 01:04:41 WARN scheduler.AsyncEventQueue: Dropped 2335 events from shared since Sat Jul 23 01:03:40 CST 2022.
2022-07-23 01:05:42 WARN scheduler.AsyncEventQueue: Dropped 2252 events from shared since Sat Jul 23 01:04:41 CST 2022.
2022-07-23 01:06:42 WARN scheduler.AsyncEventQueue: Dropped 1658 events from shared since Sat Jul 23 01:05:42 CST 2022.
2022-07-23 01:07:42 WARN scheduler.AsyncEventQueue: Dropped 1405 events from shared since Sat Jul 23 01:06:42 CST 2022.
2022-07-23 01:08:43 WARN scheduler.AsyncEventQueue: Dropped 1651 events from shared since Sat Jul 23 01:07:42 CST 2022.
2022-07-23 01:09:43 WARN scheduler.AsyncEventQueue: Dropped 1983 events from shared since Sat Jul 23 01:08:43 CST 2022.
2022-07-23 01:10:43 WARN scheduler.AsyncEventQueue: Dropped 1680 events from shared since Sat Jul 23 01:09:43 CST 2022.
2022-07-23 01:11:43 WARN scheduler.AsyncEventQueue: Dropped 1643 events from shared since Sat Jul 23 01:10:43 CST 2022.
2022-07-23 01:12:44 WARN scheduler.AsyncEventQueue: Dropped 1959 events from shared since Sat Jul 23 01:11:43 CST 2022.
2022-07-23 01:13:45 WARN scheduler.AsyncEventQueue: Dropped 2315 events from shared since Sat Jul 23 01:12:44 CST 2022.
2022-07-23 01:14:47 WARN scheduler.AsyncEventQueue: Dropped 2473 events from shared since Sat Jul 23 01:13:45 CST 2022.
2022-07-23 01:15:47 WARN scheduler.AsyncEventQueue: Dropped 1962 events from shared since Sat Jul 23 01:14:47 CST 2022.
2022-07-23 01:16:48 WARN scheduler.AsyncEventQueue: Dropped 1645 events from shared since Sat Jul 23 01:15:47 CST 2022.
2022-07-23 01:17:48 WARN scheduler.AsyncEventQueue: Dropped 1885 events from shared since Sat Jul 23 01:16:48 CST 2022.
2022-07-23 01:18:48 WARN scheduler.AsyncEventQueue: Dropped 2391 events from shared since Sat Jul 23 01:17:48 CST 2022.
2022-07-23 01:19:48 WARN scheduler.AsyncEventQueue: Dropped 1501 events from shared since Sat Jul 23 01:18:48 CST 2022.
2022-07-23 01:20:49 WARN scheduler.AsyncEventQueue: Dropped 1733 events from shared since Sat Jul 23 01:19:48 CST 2022.
2022-07-23 01:21:49 WARN scheduler.AsyncEventQueue: Dropped 1867 events from shared since Sat Jul 23 01:20:49 CST 2022.
2022-07-23 01:22:50 WARN scheduler.AsyncEventQueue: Dropped 1561 events from shared since Sat Jul 23 01:21:49 CST 2022.
2022-07-23 01:23:51 WARN scheduler.AsyncEventQueue: Dropped 1364 events from shared since Sat Jul 23 01:22:50 CST 2022.
2022-07-23 01:24:52 WARN scheduler.AsyncEventQueue: Dropped 1579 events from shared since Sat Jul 23 01:23:51 CST 2022.
2022-07-23 01:25:52 WARN scheduler.AsyncEventQueue: Dropped 1847 events from shared since Sat Jul 23 01:24:52 CST 2022.
Exception in thread "streaming-job-executor-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.xbean.asm7.ClassReader.readLabel(ClassReader.java:2447)
at org.apache.xbean.asm7.ClassReader.createDebugLabel(ClassReader.java:2477)
at org.apache.xbean.asm7.ClassReader.readCode(ClassReader.java:1689)
at org.apache.xbean.asm7.ClassReader.readMethod(ClassReader.java:1284)
at org.apache.xbean.asm7.ClassReader.accept(ClassReader.java:688)
at org.apache.xbean.asm7.ClassReader.accept(ClassReader.java:400)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:359)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2362)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$1(RDD.scala:834)
at org.apache.spark.rdd.RDD$$Lambda$2785/604434085.apply(Unknown Source)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:833)
at org.apache.spark.sql.Dataset.rdd$lzycompute(Dataset.scala:3200)
at org.apache.spark.sql.Dataset.rdd(Dataset.scala:3198)
at cn.huorong.utils.PhoenixUtil$.jdbcBatchInsert(PhoenixUtil.scala:216)
at cn.huorong.run.SampleTaskSinkHbaseMapping_OfficialService.storePhoenix(SampleTaskSinkHbaseMapping_OfficialService.scala:94)
at cn.huorong.run.SampleTaskSinkHbaseMapping_OfficialService.$anonfun$sink$1(SampleTaskSinkHbaseMapping_OfficialService.scala:74)
at cn.huorong.run.SampleTaskSinkHbaseMapping_OfficialService.$anonfun$sink$1$adapted(SampleTaskSinkHbaseMapping_OfficialService.scala:37)
at cn.huorong.run.SampleTaskSinkHbaseMapping_OfficialService$$Lambda$1277/1357069303.apply(Unknown Source)
at org.apache.spark.streaming.dstream.DStream.$anonfun$foreachRDD$2(DStream.scala:629)
at org.apache.spark.streaming.dstream.DStream.$anonfun$foreachRDD$2$adapted(DStream.scala:629)
at org.apache.spark.streaming.dstream.DStream$$Lambda$1291/1167476357.apply(Unknown Source)
at org.apache.spark.streaming.dstream.ForEachDStream.$anonfun$generateJob$2(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$Lambda$1576/1966952151.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:417)
at org.apache.spark.streaming.dstream.ForEachDStream.$anonfun$generateJob$1(ForEachDStream.scala:51)
at org.apache.spark.streaming.dstream.ForEachDStream$$Lambda$1563/607343052.apply$mcV$sp(Unknown Source)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
2022-07-23 02:01:34 WARN scheduler.AsyncEventQueue: Dropped 429 events from shared since Sat Jul 23 02:00:29 CST 2022.
Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError: GC overhead limit exceeded
at scala.runtime.ObjectRef.create(ObjectRef.java:24)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:86)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Cause analysis :
1. Main cause :
2022-07-23 01:03:40 WARN scheduler.AsyncEventQueue: Dropped 1 events from shared since the application started.
- all Spark Homework 、 Phases and tasks are pushed to the event queue .
- The backend listener reads from this queue Spark UI event , And present Spark UI.
- Event queue (
spark.scheduler.listenerbus.eventqueue.capacity
) The default capacity of is 10000.
If the number of events pushed to the event queue exceeds the number of events available to the back-end listener , The oldest event will be deleted from the queue , And listeners will never use them .
These events will be lost , And not in Spark UI In .
2. Source code analysis
/** initialization event Queue size LISTENER_BUS_EVENT_QUEUE_PREFIX = "spark.scheduler.listenerbus.eventqueue" LISTENER_BUS_EVENT_QUEUE_CAPACITY = .createWithDefault(10000) **/
private[scheduler] def capacity: Int = {
val queueSize = conf.getInt(s"$LISTENER_BUS_EVENT_QUEUE_PREFIX.$name.capacity",
conf.get(LISTENER_BUS_EVENT_QUEUE_CAPACITY))
assert(queueSize > 0, s"capacity for event queue $name must be greater than 0, " +
s"but $queueSize is configured.")
queueSize // Default 10000
}
3.Spark Official website
Solution :
1. Solve the loss event The method of is actually to use Spark Provided parameters , Statically, the capacity of the queue becomes larger when initializing , This needs to be driver A little more memory
2. Cluster level cluster Spark The configuration of the spark.scheduler.listenerbus.eventqueue.capacity The value is set to be greater than 10000 Value .
3. This value sets the capacity of the application status event queue , It contains the events of the internal application state listener . Increase this value , The event queue can hold more events , But it may cause the driver to use more memory .
# Reference resources
Spark in Histroy Server lose task,job and Stage Problem research
The newly discovered
lately , Colleagues found git It was mentioned by others PR, Found the essence of the problem , The original link is posted below
https://github.com/apache/spark/pull/31839
original text :
Translate :
This PR Proposed a repair ExectionListenerBus An alternative to memory leaks , This method will automatically clear these memory leaks .
Basically , Our idea is to registerSparkListenerForCleanup Add to ContextCleaner,
So when SparkSession By GC‘ ed when , We can go from LiveListenerBus Delete in ExectionListenerBus.
On the other hand , In order to make SparkSession can GC, We need to get rid of ExectionListenerBus Medium SparkSession quote .
therefore , We introduced sessionUUID ( One SparkSession Unique identifier of ) To replace SparkSession object .
SPARK-34087
analysis
We can see from this , This is spark3.0.1 One of the bug,ExecutionListenerBus
This thing will continue to grow ,gc It will not decrease after , And because the default queue length is only 1 ten thousand , Growth to 1 ten thousand , Will delete the old , But one problem is to delete the old speed Less than the newly increased speed , Then the queue will become very long , Being in memory all the time will lead to Driver OOM
Let's record how to view it ExecutionListenerBus
- 1. find driver The node
- find driver Where AM
[node04 userconf]# jps | grep ApplicationMaster
168299 ApplicationMaster
168441 ApplicationMaster
[node04 userconf]# ps -ef | grep application_1658675121201_0408 | grep 168441
hadoop 168441 168429 24 Jul27 ? 07:53:51 /usr/java/jdk1.8.0_202/bin/java -server -Xmx2048m -Djava.io.tmpdir=/tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1658675121201_0408/container_e47_1658675121201_0408_01_000001/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -Dspark.yarn.app.container.log.dir=/var/log/udp/2.0.0.0/hadoop/userlogs/application_1658675121201_0408/container_e47_1658675121201_0408_01_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class cn.huorong.SampleTaskScanMapping_Official --jar file:/data/udp/2.0.0.0/dolphinscheduler/exec/process/5856696115520/5888199825472_7/70/4192/spark real time /taskmapping_official_stream-3.0.3.jar --arg -maxR --arg 100 --arg -t --arg hr_task_scan_official --arg -i --arg 3 --arg -g --arg mappingOfficialHbaseOfficial --arg -pn --arg OFFICIAL --arg -ptl --arg SAMPLE_TASK_SCAN,SCAN_SHA1_M_TASK,SCAN_TASK_M_SHA1 --arg -hp --arg p --arg -local --arg false --properties-file /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1658675121201_0408/container_e47_1658675121201_0408_01_000001/__spark_conf__/__spark_conf__.properties --dist-cache-conf /tmp/hadoop-hadoop/nm-local-dir/usercache/hadoop/appcache/application_1658675121201_0408/container_e47_1658675121201_0408_01_000001/__spark_conf__/__spark_dist_cache__.properties
- 3. Get into arthas
// Switch to ordinary users
java -jar arthas-boot.jar --telnet-port 9998 -http-port -1
find 168441 Corresponding coordinates
- 4 utilize arthas View the number of instances
// Do it a few more times You can see The number of instances has been growing
[[email protected]]$ vmtool --action getInstances --className *ExecutionListenerBus --limit 10000 --express 'instances.length'
@Integer[2356]
solve :
spark The official advice , Upgrade to spark 3.0.3 Can solve , We were 3.0.1, Small version upgrade , It's replaced spark jar Bag can , Monitor again listenerBus Number , You will find that the quantity will fluctuate .
Turn that task on , Observed 2 God , Find that everything is fine . thus , We checked 5 God is solved .
边栏推荐
- 费曼学习法(符号表)
- Navigation--实现Fragment之间数据传递和数据共享
- 全志T3/A40i工业核心板,4核[email protected],国产化率达100%
- TI C6000 TMS320C6678 DSP+ Zynq-7045的PS + PL异构多核案例开发手册(2)
- Prometheus + alertmanager message alert
- 3D模型格式全解|含RVT、3DS、DWG、FBX、IFC、OSGB、OBJ等70余种
- Click back to the top JS
- 网络安全漏洞管理的探索与实践
- 2022.7.27-----leetcode.592
- 实验二:Arduino的三色灯实验
猜你喜欢
响应式织梦模板装修设计类网站
进程间通信---对管道的详细讲解(图文案例讲解)
12. < tag dynamic programming and subsequence, subarray> lt.72. edit distance
如果非要在多线程中使用 ArrayList 会发生什么?
C语言提高篇(一)
"Wei Lai Cup" 2022 Niuke summer multi school training camp 2, sign in question GJK
Cookies and sessions
Responsive dream weaving template makeup website
响应式织梦模板户外露营类网站
【质量】代码质量评价标准
随机推荐
如何在多御安全浏览器中自定义新标签页?
Responsive dream weaving template hotel room website
Resnet50 + k-fold cross validation + data enhancement + drawing (accuracy, recall, F value)
开启TLS加密的Proftpd安全FTP服务器安装指南
Pointer - golden stage
Derivation of Euler angle differential equation
如果时间不够,无法进行充分的测试怎么办?
MySQL驱动中关于时间的坑
【质量】代码质量评价标准
Jmeter之BeanShell生成MD5加密数据写入数据库
Object based real-time spatial audio rendering - Dev for dev column
Jetpack--了解ViewModel和LiveData的使用
DevOps 团队如何抵御 API 攻击?
JetPack--Navigation实现页面跳转
Try to understand the essence of low code platform design from another angle
C语言提高篇(一)
MySQL stores JSON format data
Form verification hidden input box is displayed before verification
Cookie和Session
Three methods of STM32 delay function