当前位置:网站首页>The reason why the process cannot be shut down after a spark job is executed and the solution

The reason why the process cannot be shut down after a spark job is executed and the solution

2022-06-13 05:37:00 Flying it people

Recently, the students of operation and maintenance have frequently reflected ,spark Cluster operation mode , Each execution is completed spark All process ports have been closed , But by command spark The process and port of the job cannot be closed automatically , Seriously affect the operation of other business groups , However, it is not always possible to shut down , The frequency of occurrence is not standardized , But the task is normal , Data cleaning and processing are normal , Storage is normal , After checking the log, it is found that the job will be executed when the job is completed sparksession.stop Method , It is this method that blocks the normal shutdown of the process , But the reason cannot be analyzed from the log , Consider from jvm Level to analyze and view , Is it because of memory or cpu The reason for this , use -jstack -pid Command to print jvm The stack :

Copy only part of the stack :

From the stack, we can roughly see that main Functional sparksession.stop The thread of is blocked , I won't elaborate on the status of threads , You can do it yourself google And Baidu , But what is the reason , A closer look reveals that you are waiting for a lock to be released , But why is it locked , Only in stop Method to view the source code analysis :

notice synchronized The key words can be understood , It should be locked , So look for the cause of the lock , In the stack find Found such a thread information

It turns out that this thread has obtained the lock , But why waiting Well , This is the problem to be solved , I looked at it carefully , It should be with spark Of ContextCleaner of , This method is used to clean up residual data in memory space , But it is a daemon thread , And it has been cleaned up during the operation of the job , Is it because I've been cleaning up memory , But if you think about it carefully, you won't , Because if you keep clearing the space , How can a daemon thread waiting Well , So I read it again and found another content that I could understand :

at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)

Rpc Protocol timeout ? Turned out to be spark The heartbeat detection communication timed out , But ask the operation and maintenance students , During the jam , No network abnormal alarm or memory space full alarm , So it should not be this 2 One reason , then google For a moment , I found that the big guys basically gave 2 Answer :

1)spark Node down ( Of course not )

2) This problem is caused by the skew of the data , Lead to STW, To shorten the GC Time can solve

So in spark The startup command of is added to the dynamic conf Parameters :

--conf "spark.driver.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC

--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC

Used CMS The recycle bin (jvm Simple parameter configuration , You can also configure the memory space of the new generation and the old generation , And the recycle bin ), At the same time, the spark.network.timeout from 36s Change to default 120s, At present, it has been running for several days without any previous situation , I feel a little happy , ha-ha , But I'm not sure if this is the reason , So a friend with clear handling experience can leave a message to me , thank !.

原网站

版权声明
本文为[Flying it people]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202280509014532.html