当前位置:网站首页>Improve reduce parallelism in shuffle operation

Improve reduce parallelism in shuffle operation

2022-07-26 05:07:00 Shangsilicon Valley iron powder

When scheme one and scheme two have no good effect on data skew processing , Consider improving shuffle In process reduce End parallelism ,reduce The improvement of end parallelism increases reduce End task The number of , Then each task The amount of data allocated will be reduced accordingly , This alleviates the problem of data skew .

Big data training

  1. reduce Setting of end parallelism

In most of shuffle In operator , You can pass in a parallel setting parameter , such as reduceByKey(500), This parameter will determine shuffle In the process reduce End parallelism , It's going on shuffle During operation , It will create a specified number of reduce task. about Spark SQL Medium shuffle Class statement , such as group by、join etc. , You need to set a parameter , namely spark.sql.shuffle.partitions, This parameter represents shuffle read task Parallelism of , The default value is 200, It's a little too small for many scenes .

increase shuffle read task The number of , Could have been assigned to a task The multiple key To assign to more than one task, So that each task Processing less data than before . for instance , If there were 5 individual key, Every key Corresponding 10 Data , this 5 individual key It's all assigned to one task Of , So this task We have to deal with 50 Data . And added shuffle read task in the future , Every task Just assign to one key, each task Will deal with 10 Data , So naturally each task The execution time will be shorter .

  1. reduce There are some defects in the setting of end parallelism

Improve reduce End parallelism does not fundamentally change the nature and problem of data skew ( Schemes 1 and 2 fundamentally avoid the occurrence of data skew ), Just try to alleviate and lighten as much as possible shuffle reduce task Data pressure , And the problem of data skew , Apply to have more key The corresponding amount of data is relatively large .

This solution usually can't completely solve data skew , Because if there are some extremes , For example, a certain key The corresponding amount of data is 100 ten thousand , So no matter your task What's the increase in quantity , This corresponds to 100 Million data key I'm sure there will still be one task To deal with , So data skew is bound to happen . So this solution can only be said to be the first way to try to use when finding data skew , Try to ease the data skew in a simple way , Or it can be used in combination with other schemes .

In an ideal situation ,reduce After the end parallelism is improved , It will alleviate the problem of data skew to a certain extent , Even basically eliminate data skew ; however , In some cases , Will only let the original data skew and slow down task Speed up a little bit , Or avoid some of them task Of OOM problem , however , Still running slowly , here , Give up plan 3 in time , Start to try the later solution .

 

原网站

版权声明
本文为[Shangsilicon Valley iron powder]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/207/202207260500289573.html