当前位置:网站首页>Improve reduce parallelism in shuffle operation
Improve reduce parallelism in shuffle operation
2022-07-26 05:07:00 【Shangsilicon Valley iron powder】
When scheme one and scheme two have no good effect on data skew processing , Consider improving shuffle In process reduce End parallelism ,reduce The improvement of end parallelism increases reduce End task The number of , Then each task The amount of data allocated will be reduced accordingly , This alleviates the problem of data skew .

Big data training
- reduce Setting of end parallelism
In most of shuffle In operator , You can pass in a parallel setting parameter , such as reduceByKey(500), This parameter will determine shuffle In the process reduce End parallelism , It's going on shuffle During operation , It will create a specified number of reduce task. about Spark SQL Medium shuffle Class statement , such as group by、join etc. , You need to set a parameter , namely spark.sql.shuffle.partitions, This parameter represents shuffle read task Parallelism of , The default value is 200, It's a little too small for many scenes .
increase shuffle read task The number of , Could have been assigned to a task The multiple key To assign to more than one task, So that each task Processing less data than before . for instance , If there were 5 individual key, Every key Corresponding 10 Data , this 5 individual key It's all assigned to one task Of , So this task We have to deal with 50 Data . And added shuffle read task in the future , Every task Just assign to one key, each task Will deal with 10 Data , So naturally each task The execution time will be shorter .
- reduce There are some defects in the setting of end parallelism
Improve reduce End parallelism does not fundamentally change the nature and problem of data skew ( Schemes 1 and 2 fundamentally avoid the occurrence of data skew ), Just try to alleviate and lighten as much as possible shuffle reduce task Data pressure , And the problem of data skew , Apply to have more key The corresponding amount of data is relatively large .
This solution usually can't completely solve data skew , Because if there are some extremes , For example, a certain key The corresponding amount of data is 100 ten thousand , So no matter your task What's the increase in quantity , This corresponds to 100 Million data key I'm sure there will still be one task To deal with , So data skew is bound to happen . So this solution can only be said to be the first way to try to use when finding data skew , Try to ease the data skew in a simple way , Or it can be used in combination with other schemes .
In an ideal situation ,reduce After the end parallelism is improved , It will alleviate the problem of data skew to a certain extent , Even basically eliminate data skew ; however , In some cases , Will only let the original data skew and slow down task Speed up a little bit , Or avoid some of them task Of OOM problem , however , Still running slowly , here , Give up plan 3 in time , Start to try the later solution .
边栏推荐
- Redis solves the problem of oversold inventory
- Full analysis of domain name resolution process means better text understanding
- The importance of supporting horizontal expansion of time series database
- Nacos 介绍和部署
- C语言力扣第41题之缺失的第一个正数。两种方法,预处理快排与原地哈希
- [mathematical modeling] analytic hierarchy process (AHP)
- There was an unexpected error (type=method not allowed, status=405)
- SWAT模型在水文水资源、面源污染模拟中的实践技术
- 基于遥感解译与GIS技术环境影响评价图件制作
- 面试之请详细说下synchronized的实现原理以及相关的锁
猜你喜欢

Embedded sharing collection 21

minipcie接口CAN卡解决工控机扩展CAN通道的难题 minipcie CAN
![[mathematical modeling] analytic hierarchy process (AHP)](/img/20/8ebd951a0e0c46d1967c6c8b078a4a.png)
[mathematical modeling] analytic hierarchy process (AHP)

What is the real HTAP? (1) Background article

C language -- string function, memory function collection and Simulation Implementation

Axi protocol (5): burst mechanism of Axi protocol

How to connect tdengine through idea database management tool?

Minipcie interface can card solves the problem of industrial computer expanding can channel minipcie can

Date and time function of MySQL function summary

pillow的原因ImportError: cannot import name ‘PILLOW_VERSION‘ from ‘PIL‘,如何安装pillow<7.0.0
随机推荐
[Luogu] p1383 advanced typewriter
一次线上事故,我顿悟了异步的精髓
Distance between bus stops: simple simulation problem
uniapp小程序框架-一套代码,多段覆盖
Excel VBA: realize automatic drop-down filling formula to the last line
[acwing] 1268. Simple questions
The elderly who claim alimony from other children after being supported by their widowed daughter-in-law should be supported
“双碳”目标下资源环境中的可计算一般均衡(CGE)模型实践技术
List converted to tree real use of the project
Axi protocol (5): burst mechanism of Axi protocol
What are the demand management software for small and medium-sized enterprises
Embedded sharing collection 21
ALV入门
Seata两阶段提交AT详解
CountLaunch Demo的测试
Switch and router technology: dynamic routing protocol, rip routing protocol and OSPF routing protocol
阿里云工业视觉智能工程师ACP认证——备考
Seata submits at details in two stages
Install nccl \ mpirun \ horovod \ NVIDIA tensorflow (3090ti)
Nacos 介绍和部署