当前位置：网站首页>Explanation of spark common parameters

Explanation of spark common parameters

2022-06-11 02:34:00 【hzp666】

Spark The default configuration file for is located in this location on the fortress machine : $SPARK_CONF_DIR/spark-defaults.conf, Users can view and understand .

It should be noted that , The default value has the lowest priority , If the user explicitly specifies the configuration when submitting a task or in the code , The user configuration shall prevail . On the basis that users understand the meaning of parameters , Parameters can be adjusted according to specific task conditions （ Modify submission parameters --conf value , No spark-defaults.conf file ）.

The following common parameter configurations can be configured through --conf XXX=Y Way to use , For other parameters and descriptions, please refer to Configuration - Spark 3.2.1 Documentation

Parameter name	recommended value	explain
spark.master	yarn	Which resource scheduler to use , In general use yarn. Local debugging can use local
spark.submit.deployMode	cluster	driver Where the program runs , Debugging can use client, Online task suggestions cluster.
spark.driver.cores	4	driver Maximum use cpu( Threads ) Count
spark.driver.memory	4-10g	driver Request memory size
spark.executor.memory	3. Spark Task tuning techniques	Single executor Request heap memory size
spark.python.worker.memory	spark.executor.memory/2	Generally, the default value is used
spark.yarn.executor.memoryOverhead	3072	Single executor Request out of heap memory size , Generally, the default value is used
spark.executor.cores	3. Spark Task tuning techniques	Single executor Maximum concurrency task Count
spark.executor.instances	3. Spark Task tuning techniques	executor Count
spark.speculation	The default value is false	The speculative execution mechanism defaults to false（ close ）, If the operation gets stuck occasionally, you can try to open .
spark.default.parallelism	3. Spark Task tuning techniques	Control default RDD Of partithion Count , Read hdfs When you file partition Count blocksize And whether to merge the input .
spark.sql.shuffle.partitions	3. Spark Task tuning techniques	perform sql or sql Class operator shuffle Partition number , This value should be increased when the amount of data is large .
spark.pyspark.python	python2/python3/python3.5	Appoint pyspark The use of python edition （ If you use docker Mirror image , Please confirm whether there is a corresponding version in the image , The platform basic image only has python2）
spark.log.level	The default value is info	ALL, TRACE, DEBUG, INFO, WARN, ERROR, FATAL, OFF, Case insensitive .
spark.sql.hive.mergeFiles	The default value is false	Opening will automatically close spark-sql Small files generated
spark.hadoop.jd.bdp.streaming.monitor.enable	The default value is false	Open or not streaming Homework batch Backlog alarm function , The default is false, It can be done by --conf spark.hadoop.jd.bdp.streaming.monitor.enable=true Turn on
spark.hadoop.jd.bdp.batch.threshold	The default value is 10	streaming Homework batch Backlog alarm threshold , The default value is 10, Users can adjust according to their needs , for example ： --conf spark.hadoop.jd.bdp.batch.threshold=20
spark.hadoop.jd.bdp.user.define.erps	The alarm group configured by the platform is used by default	For similar streaming Homework batch Backlog and other indicators that only need users' attention , The user can customize the alarm group , for example ： --conf spark.hadoop.jd.bdp.user.define.erps="baibing12\|maruilei" （ Be careful ： Multi person configurable , adjacent erp Use vertical line \| Separate ）
spark.isLoadHivercFile spark.sql.tempudf.ignoreIfExists	Default false	Whether to load all hive udf( Only support spark-sql Next use , I won't support it spark-submit、pyspark).(HiveTask The inside has been opened , The user doesn't need extra settings )

原网站

版权声明
本文为[hzp666]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203020609400477.html

当前位置：网站首页>Explanation of spark common parameters

Explanation of spark common parameters

边栏推荐

猜你喜欢

随机推荐