当前位置：网站首页>Spark optimization - Troubleshooting

Spark optimization - Troubleshooting

2022-06-13 03:34:00 【TRX1024】

Catalog

Troubleshooting I ： control reduce End buffer size to avoid OOM

Troubleshooting II ：JVM GC As a result of shuffle File pull failed

Troubleshooting three ： Solve the error caused by various serialization

Troubleshooting 4 ： Solve operator function return NULL Resulting problems

Troubleshooting V ： solve YARN-CLIENT Network card traffic surge caused by mode

Troubleshooting 6 ： solve YARN-CLUSTER Mode JVM Stack memory overflow unable to execute

Troubleshooting 7 ： solve SparkSQL As a result of JVM Stack memory overflow

Troubleshooting 8 ： Persistence and checkpoint Use

Spark Optimization series blog posts , Interested friends can have a look at ：

1. Spark Optimize —— performance ( General performance 、 operator 、Shuffle、JVM) tuning

2. Spark Optimize —— Data skew solution

3. Spark Optimize —— Troubleshooting

Troubleshooting I ： control reduce End buffer size to avoid OOM

stay Shuffle The process ,reduce End task Not until map End task Write all the data to disk and then pull it , It is map Write a little bit of data ,reduce End task They pull a little bit of data , And then immediately after the aggregation 、 The use of operator functions and other operations .

reduce End task How much data can be pulled , from reduce Pull data buffer buffer To decide , Because the pulled data is put in the first place buffer in , And then we do the follow-up processing ,buffer The default size is 48MB.reduce End task I will calculate while pulling , It doesn't have to be full every time 48MB The data of , Maybe most of the time, pull a part of the data and process it .

Although the increase reduce The size of the end buffer can reduce the number of fetches , promote Shuffle performance , But sometimes map The amount of data on the end is very large , The speed of writing is very fast , here reduce End of all task In the pulling process , It is possible to reach the maximum limit of their own buffer , namely 48MB, here , Plus reduce End of the aggregate function of the code , You can create a lot of objects , This can hardly lead to a memory overflow , namely OOM.

If it happens reduce End memory overflow problem , We can consider reducing reduce The size of the end pull data buffer , For example, reduce it to 12MB.

This problem has occurred in the actual production environment , This is a typical performance for execution principle .reduce The buffer size of end pull data is reduced , It's not easy to cause OOM, But the corresponding ,reudce The pull times of the end increase , It causes more network transmission overhead , Cause performance degradation .

Be careful , Make sure that the task runs , Consider performance optimization .

Troubleshooting II ：JVM GC As a result of shuffle File pull failed

stay Spark In homework , Sometimes there will be shuffle file not found Error of , This is a very common error , Sometimes after this mistake , Choose to do it again , We will not report this kind of mistake again .

The possible reasons for the above problems are Shuffle In operation , Back stage Of task Want to go to the last one stage Of task Where Executor Pull data , As a result, the other party is executing GC, perform GC It can lead to Executor All of the work sites inside are stopped , such as BlockManager、 be based on netty Network communication, etc , This will lead to the following task I have been pulling data for a long time, but I haven't got it , Will report shuffle file not found Error of , And the second time it's executed again, it won't happen again .

Can be adjusted by reduce End pull data retries and reduce The end pulls the two parameters of data interval to check Shuffle Performance tuning , Increase the parameter value , bring reduce The number of retries for end pull data increases , And the waiting interval is longer after each failure .

val conf = new SparkConf()
.set("spark.shuffle.io.maxRetries", "60")
.set("spark.shuffle.io.retryWait", "60s")

Troubleshooting three ： Solve the error caused by various serialization

When Spark An error was reported during the operation of the job , And the error message contains Serializable And similar words , Then the error may be caused by the serialization problem .

Here are three things to note about serialization ：

* As RDD Custom class of element type of , Must be serializable ;
* External custom variables that can be used in operator functions , Must be serializable ;
* Not in RDD Element type of 、 Operator functions use third-party types that do not support serialization , for example Connection.

Troubleshooting 4 ： solve Operator function return NULL Resulting problems

In some operator functions , We need to have a return value , But in some cases we don't want a return value , At this point, if we go directly back to NULL, Will report a mistake , for example Scala.Math(NULL) abnormal . If you come across something , You don't want a return value , Then it can be solved in the following ways ：

* Return special value , No return NULL, for example “-1”;
* In the operator, we get a RDD after , You can do this RDD perform filter operation , Data filtering , The value will be -1 To filter out the data of ;
* After use filter After operator , Continue to call coalesce Operator optimization .

Troubleshooting V ： solve YARN-CLIENT Network card traffic surge caused by mode

YARN-client The operation principle of the mode is shown in the figure below ：

stay YARN-client In mode ,Driver Start on the local machine , and Driver Responsible for all task scheduling , Need and YARN Multiple on the cluster Executor Communicate frequently .

Suppose there is 100 individual Executor, 1000 individual task, Then each Executor Assigned to 10 individual task, after ,Driver Follow... Frequently Executor Running on 1000 individual task communicate , There's a lot of communication data , And the communication category is particularly high . This leads to the possibility of Spark During the operation of the task , Because of the frequent and massive network communication , The network card traffic of the local machine will surge .

Be careful ,YARN-client Patterns are only used in test environments , And the reason for using YARN-client Pattern , It's because you can see the detailed and comprehensive log Information , By looking at log, You can lock down problems in the program , Avoid failure in the production environment .

In production environment , It must be YARN-cluster Pattern . stay YARN-cluster In mode , It will not cause the local machine network card traffic surge problem , If YARN-cluster There are network communication problems in the mode , It needs to be solved by the operation and maintenance team .

Troubleshooting 6 ： solve YARN-CLUSTER Mode JVM Stack memory overflow unable to execute

YARN-cluster The operation principle of the mode is shown in the figure below ：

When Spark The assignment contains SparkSQL When , May come across YARN-client Mode can run , however
YARN-cluster Cannot submit run in mode （ Report OOM error ） The situation of .

YARN-client In mode ,Driver It runs on the local machine ,Spark The use of JVM Of PermGen Of
To configure , It's on the local machine spark-class file ,JVM The size of the permanent generation is 128MB, There's no problem with this ,
But in YARN-cluster In mode ,Driver Running on the YARN On a node of the cluster , It is used without passing through
Default settings for configuration ,PermGen The size of the permanent generation is 82MB.

SparkSQL It's going to be very complicated inside SQL Semantic analysis of 、 Syntax tree conversion and so on , Very complicated , If
sql The sentence itself is very complicated , Then it is likely to lead to performance loss and memory occupation , Especially for PermGen
It's going to take up a lot of .

therefore , If at this time PermGen It's better to occupy than 82MB, But it's less than 128MB, Will appear YARN-client
Mode can run ,YARN-cluster Failure to run in mode .

The solutions to the above problems are added PermGen The capacity of , Need to be in spark-submit In the script, the related parameters are
Set up , The setting method is shown in the code listing .

--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"

Through the above method, we set up Driver The size of the permanent generation , The default is 128MB, Maximum 256MB, In this way, the problems mentioned above can be avoided .

Troubleshooting 7 ： solve SparkSQL As a result of JVM Stack memory overflow

When SparkSQL Of sql There are hundreds of sentences or When a keyword , It could happen Driver Terminal JVM Stack memory overflow .

JVM Stack memory overflow is basically due to too many method levels being called , A lot of , Very deep , Beyond the JVM Stack depth limited recursion .（ We speculate SparkSQL A large number of or At the time of statement , In parsing SQL when , For example, when converting to syntax tree or generating execution plan , about or Is recursive ,or For a long time , A lot of recursion happens ）

here , It is suggested that a sql The statement is split into multiple sql Statement to execute , Every one of them sql Try to make sure that 100 Within clauses . According to the actual production environment test , One sql Of the statement or Key words are controlled in 100 Within a , Usually it doesn't lead to JVM Stack memory overflow .

Troubleshooting 8 ： Persistence and checkpoint Use

Spark Persistence is OK in most cases , But sometimes data can be lost , If the data is lost , You need to recalculate the lost data , Cache and use after calculation , To avoid data loss , You can choose to do this RDD Conduct checkpoint, That is to persist a copy of the data to a fault-tolerant file system （ such as HDFS）.

One RDD Cache and checkpoint after , If you find a cache loss , It's going to take precedence over checkpoint Does the data exist , If there is , Will use checkpoint data , Instead of recalculating . That is said ,checkpoint Can be regarded as cache The security mechanism of , If cache Failure , Just use checkpoint The data of .

Use checkpoint The advantage of is that it improves Spark Reliability of operations , Once there is a problem with the cache , No need to recalculate data , The disadvantage is ,checkpoint You need to write data to HDFS Equal file system , It consumes a lot of performance .

原网站

版权声明
本文为[TRX1024]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202280529582361.html