当前位置:网站首页>Explanation of spark common parameters

Explanation of spark common parameters

2022-06-11 02:34:00 hzp666

Spark The default configuration file for is located in this location on the fortress machine : $SPARK_CONF_DIR/spark-defaults.conf, Users can view and understand .

It should be noted that , The default value has the lowest priority , If the user explicitly specifies the configuration when submitting a task or in the code , The user configuration shall prevail . On the basis that users understand the meaning of parameters , Parameters can be adjusted according to specific task conditions ( Modify submission parameters --conf value , No spark-defaults.conf file ).

The following common parameter configurations can be configured through  --conf XXX=Y Way to use , For other parameters and descriptions, please refer to  Configuration - Spark 3.2.1 Documentation

Parameter name

recommended value

explain

spark.masteryarn Which resource scheduler to use , In general use yarn. Local debugging can use local
spark.submit.deployModeclusterdriver Where the program runs , Debugging can use client, Online task suggestions cluster.
spark.driver.cores4driver Maximum use cpu( Threads ) Count
spark.driver.memory4-10gdriver Request memory size
spark.executor.memory3. Spark Task tuning techniques Single executor Request heap memory size
spark.python.worker.memoryspark.executor.memory/2 Generally, the default value is used
spark.yarn.executor.memoryOverhead3072 Single executor Request out of heap memory size , Generally, the default value is used
spark.executor.cores3. Spark Task tuning techniques Single executor Maximum concurrency task Count
spark.executor.instances3. Spark Task tuning techniques executor Count
spark.speculation The default value is false The speculative execution mechanism defaults to false( close ), If the operation gets stuck occasionally, you can try to open .
spark.default.parallelism3. Spark Task tuning techniques Control default RDD Of partithion Count , Read hdfs When you file partition Count blocksize And whether to merge the input .
spark.sql.shuffle.partitions

3. Spark Task tuning techniques

perform sql or sql Class operator shuffle Partition number , This value should be increased when the amount of data is large .
spark.pyspark.pythonpython2/python3/python3.5 Appoint pyspark The use of python edition ( If you use docker Mirror image , Please confirm whether there is a corresponding version in the image , The platform basic image only has python2)
spark.log.level The default value is infoALL, TRACE, DEBUG, INFO, WARN, ERROR, FATAL, OFF, Case insensitive .
spark.sql.hive.mergeFiles The default value is false

Opening will automatically close spark-sql Small files generated

spark.hadoop.jd.bdp.streaming.monitor.enable The default value is false

Open or not streaming Homework batch Backlog alarm function , The default is false, It can be done by

--conf spark.hadoop.jd.bdp.streaming.monitor.enable=true Turn on

spark.hadoop.jd.bdp.batch.threshold The default value is 10

streaming Homework batch Backlog alarm threshold , The default value is 10, Users can adjust according to their needs , for example :

--conf spark.hadoop.jd.bdp.batch.threshold=20

spark.hadoop.jd.bdp.user.define.erps The alarm group configured by the platform is used by default

For similar streaming Homework batch Backlog and other indicators that only need users' attention , The user can customize the alarm group , for example :

--conf spark.hadoop.jd.bdp.user.define.erps="baibing12|maruilei"

( Be careful : Multi person configurable , adjacent erp Use vertical line | Separate )

spark.isLoadHivercFile

spark.sql.tempudf.ignoreIfExists

Default false Whether to load all hive udf( Only support spark-sql Next use , I won't support it spark-submit、pyspark).(HiveTask The inside has been opened , The user doesn't need extra settings )
原网站

版权声明
本文为[hzp666]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203020609400477.html