当前位置:网站首页>Spark optimization - Troubleshooting
Spark optimization - Troubleshooting
2022-06-13 03:34:00 【TRX1024】
Catalog
Troubleshooting I : control reduce End buffer size to avoid OOM
Troubleshooting II :JVM GC As a result of shuffle File pull failed
Troubleshooting three : Solve the error caused by various serialization
Troubleshooting 4 : Solve operator function return NULL Resulting problems
Troubleshooting V : solve YARN-CLIENT Network card traffic surge caused by mode
Troubleshooting 6 : solve YARN-CLUSTER Mode JVM Stack memory overflow unable to execute
Troubleshooting 7 : solve SparkSQL As a result of JVM Stack memory overflow
Troubleshooting 8 : Persistence and checkpoint Use
Spark Optimization series blog posts , Interested friends can have a look at :
1. Spark Optimize —— performance ( General performance 、 operator 、Shuffle、JVM) tuning
2. Spark Optimize —— Data skew solution
3. Spark Optimize —— Troubleshooting
Troubleshooting I : control reduce End buffer size to avoid OOM
stay Shuffle The process ,reduce End task Not until map End task Write all the data to disk and then pull it , It is map Write a little bit of data ,reduce End task They pull a little bit of data , And then immediately after the aggregation 、 The use of operator functions and other operations .
reduce End task How much data can be pulled , from reduce Pull data buffer buffer To decide , Because the pulled data is put in the first place buffer in , And then we do the follow-up processing ,buffer The default size is 48MB.reduce End task I will calculate while pulling , It doesn't have to be full every time 48MB The data of , Maybe most of the time, pull a part of the data and process it .
Although the increase reduce The size of the end buffer can reduce the number of fetches , promote Shuffle performance , But sometimes map The amount of data on the end is very large , The speed of writing is very fast , here reduce End of all task In the pulling process , It is possible to reach the maximum limit of their own buffer , namely 48MB, here , Plus reduce End of the aggregate function of the code , You can create a lot of objects , This can hardly lead to a memory overflow , namely OOM.
If it happens reduce End memory overflow problem , We can consider reducing reduce The size of the end pull data buffer , For example, reduce it to 12MB.
This problem has occurred in the actual production environment , This is a typical performance for execution principle .reduce The buffer size of end pull data is reduced , It's not easy to cause OOM, But the corresponding ,reudce The pull times of the end increase , It causes more network transmission overhead , Cause performance degradation .
Be careful , Make sure that the task runs , Consider performance optimization .
Troubleshooting II :JVM GC As a result of shuffle File pull failed
stay Spark In homework , Sometimes there will be shuffle file not found Error of , This is a very common error , Sometimes after this mistake , Choose to do it again , We will not report this kind of mistake again .
The possible reasons for the above problems are Shuffle In operation , Back stage Of task Want to go to the last one stage Of task Where Executor Pull data , As a result, the other party is executing GC, perform GC It can lead to Executor All of the work sites inside are stopped , such as BlockManager、 be based on netty Network communication, etc , This will lead to the following task I have been pulling data for a long time, but I haven't got it , Will report shuffle file not found Error of , And the second time it's executed again, it won't happen again .
Can be adjusted by reduce End pull data retries and reduce The end pulls the two parameters of data interval to check Shuffle Performance tuning , Increase the parameter value , bring reduce The number of retries for end pull data increases , And the waiting interval is longer after each failure .
val conf = new SparkConf()
.set("spark.shuffle.io.maxRetries", "60")
.set("spark.shuffle.io.retryWait", "60s")
Troubleshooting three : Solve the error caused by various serialization
When Spark An error was reported during the operation of the job , And the error message contains Serializable And similar words , Then the error may be caused by the serialization problem .
Here are three things to note about serialization :
- * As RDD Custom class of element type of , Must be serializable ;
- * External custom variables that can be used in operator functions , Must be serializable ;
- * Not in RDD Element type of 、 Operator functions use third-party types that do not support serialization , for example Connection.
Troubleshooting 4 : solve Operator function return NULL Resulting problems
In some operator functions , We need to have a return value , But in some cases we don't want a return value , At this point, if we go directly back to NULL, Will report a mistake , for example Scala.Math(NULL) abnormal . If you come across something , You don't want a return value , Then it can be solved in the following ways :
- * Return special value , No return NULL, for example “-1”;
- * In the operator, we get a RDD after , You can do this RDD perform filter operation , Data filtering , The value will be -1 To filter out the data of ;
- * After use filter After operator , Continue to call coalesce Operator optimization .
Troubleshooting V : solve YARN-CLIENT Network card traffic surge caused by mode
YARN-client The operation principle of the mode is shown in the figure below :
stay YARN-client In mode ,Driver Start on the local machine , and Driver Responsible for all task scheduling , Need and YARN Multiple on the cluster Executor Communicate frequently .
Suppose there is 100 individual Executor, 1000 individual task, Then each Executor Assigned to 10 individual task, after ,Driver Follow... Frequently Executor Running on 1000 individual task communicate , There's a lot of communication data , And the communication category is particularly high . This leads to the possibility of Spark During the operation of the task , Because of the frequent and massive network communication , The network card traffic of the local machine will surge .
Be careful ,YARN-client Patterns are only used in test environments , And the reason for using YARN-client Pattern , It's because you can see the detailed and comprehensive log Information , By looking at log, You can lock down problems in the program , Avoid failure in the production environment .
In production environment , It must be YARN-cluster Pattern . stay YARN-cluster In mode , It will not cause the local machine network card traffic surge problem , If YARN-cluster There are network communication problems in the mode , It needs to be solved by the operation and maintenance team .
Troubleshooting 6 : solve YARN-CLUSTER Mode JVM Stack memory overflow unable to execute
YARN-cluster The operation principle of the mode is shown in the figure below :
When Spark The assignment contains SparkSQL When , May come across YARN-client Mode can run , however
YARN-cluster Cannot submit run in mode ( Report OOM error ) The situation of .
YARN-client In mode ,Driver It runs on the local machine ,Spark The use of JVM Of PermGen Of
To configure , It's on the local machine spark-class file ,JVM The size of the permanent generation is 128MB, There's no problem with this ,
But in YARN-cluster In mode ,Driver Running on the YARN On a node of the cluster , It is used without passing through
Default settings for configuration ,PermGen The size of the permanent generation is 82MB.
SparkSQL It's going to be very complicated inside SQL Semantic analysis of 、 Syntax tree conversion and so on , Very complicated , If
sql The sentence itself is very complicated , Then it is likely to lead to performance loss and memory occupation , Especially for PermGen
It's going to take up a lot of .
therefore , If at this time PermGen It's better to occupy than 82MB, But it's less than 128MB, Will appear YARN-client
Mode can run ,YARN-cluster Failure to run in mode .
The solutions to the above problems are added PermGen The capacity of , Need to be in spark-submit In the script, the related parameters are
Set up , The setting method is shown in the code listing .
--conf spark.driver.extraJavaOptions="-XX:PermSize=128M -XX:MaxPermSize=256M"
Through the above method, we set up Driver The size of the permanent generation , The default is 128MB, Maximum 256MB, In this way, the problems mentioned above can be avoided .
Troubleshooting 7 : solve SparkSQL As a result of JVM Stack memory overflow
When SparkSQL Of sql There are hundreds of sentences or When a keyword , It could happen Driver Terminal JVM Stack memory overflow .
JVM Stack memory overflow is basically due to too many method levels being called , A lot of , Very deep , Beyond the JVM Stack depth limited recursion .( We speculate SparkSQL A large number of or At the time of statement , In parsing SQL when , For example, when converting to syntax tree or generating execution plan , about or Is recursive ,or For a long time , A lot of recursion happens )
here , It is suggested that a sql The statement is split into multiple sql Statement to execute , Every one of them sql Try to make sure that 100 Within clauses . According to the actual production environment test , One sql Of the statement or Key words are controlled in 100 Within a , Usually it doesn't lead to JVM Stack memory overflow .
Troubleshooting 8 : Persistence and checkpoint Use
Spark Persistence is OK in most cases , But sometimes data can be lost , If the data is lost , You need to recalculate the lost data , Cache and use after calculation , To avoid data loss , You can choose to do this RDD Conduct checkpoint, That is to persist a copy of the data to a fault-tolerant file system ( such as HDFS).
One RDD Cache and checkpoint after , If you find a cache loss , It's going to take precedence over checkpoint Does the data exist , If there is , Will use checkpoint data , Instead of recalculating . That is said ,checkpoint Can be regarded as cache The security mechanism of , If cache Failure , Just use checkpoint The data of .
Use checkpoint The advantage of is that it improves Spark Reliability of operations , Once there is a problem with the cache , No need to recalculate data , The disadvantage is ,checkpoint You need to write data to HDFS Equal file system , It consumes a lot of performance .
边栏推荐
- Domestic zynq standalone pl-ps interrupt commissioning
- MASA Auth - SSO与Identity设计
- Least recently used cache (source force deduction)
- [azure data platform] ETL tool (2) -- azure data factory "copy data" tool (cloud copy)
- 年金险产品保险期满之后能领多少钱?
- Azure SQL db/dw series (10) -- re understanding the query store (3) -- configuring the query store
- C语言程序设计——从键盘任意输入一个字符串,计算其实际字符个数并打印输出,要求不能使用字符串处理函数strlen(),使用自定义子函数Mystrlen()实现计算字符个数的功能。
- Quickly obtain the attributes of the sub graph root node
- P1048 [noip2005 popularization group] Drug collection
- C language programming - input a string arbitrarily from the keyboard, calculate the actual number of characters and print out. It is required that the string processing function strlen() cannot be us
猜你喜欢
Data of all bank outlets in 356 cities nationwide (as of February 13, 2022)
Panel data set of rural cities and towns: per capita consumption and expenditure of prefecture level cities 2012-2019 & rural data of provinces 2013-2019
Understanding the ongdb open source map data foundation from the development of MariaDB
Figure data * reconstruction subgraph
A data modeling optimization solution for graph data super nodes
On the career crisis of programmers at the age of 35
MySQL learning summary Xi: detailed explanation of the use of stored procedures and stored functions
[azure data platform] ETL tool (8) - ADF dataset and link service
PostgreSQL common SQL
Masa auth - SSO and identity design
随机推荐
[JVM Series 2] runtime data area
MySQL learning summary 12: system variables, user variables, definition conditions and handlers
Doris outputs numbers in currency format. The integer part is separated by commas every three digits, and the decimal part is reserved for two digits
Solution of Kitti data set unable to download
Application framework / capability blueprint
LeetCode 178. Score ranking (MySQL)
MASA Auth - SSO與Identity設計
[figure data] how long does it take for the equity network to penetrate 1000 layers?
Filters in PHP
Cross border M & a database: SDC cross border database, Thomson database, A-share listed company M & a database and other multi index data (4w+)
Spark Foundation
Summary of rust language practice
Doris creates OLAP, mysql, and broker tables
Azure SQL db/dw series (10) -- re understanding the query store (3) -- configuring the query store
Yolov5 face+tensorrt: deployment based on win10+tensorrt8.2+vs2019
Masa auth - overall design from the user's perspective
MySQL learning summary 8: addition, deletion and modification of data processing
C method parameter: out
Golang picks up: why do we need generics
Summary of the latest rail transit (Subway + bus) stops and routes in key cities in China (II)