当前位置:网站首页>Sparkshuffle process and Mr shuffle process
Sparkshuffle process and Mr shuffle process
2022-07-06 21:44:00 【Big data Xiaochen】
SparkShuffle
Spark1.2 In the version 【HashShuffle】
Spark1.2 In later versions 【sortShuffle】
MR Of shuffle
Spark Of shuffle

HashshuffleManager( Abandon )
Unoptimized HashShuffle, Number of small files in the middle =【 The upstream task Number 】*【 The downstream task Number 】, To many, many
groupby and join Can trigger shuffle. The figure below shows the original join situation , Simplified to groupby Aggregation
The optimized HashShuffle, Number of small files in the middle =【Executor The number of 】*【 The downstream task The number of 】, The number has decreased exponentially .
groupby and join Can trigger shuffle.

sortshuffleManager( choose )
Common mechanisms ( Need to sort )

1- Define the data structure : If it is reduceByKey This kind of aggregate class shuffle operator , Then I will choose 【Map】 data structure , If it is join such shuffle operator , Then I will choose 【Array】 data structure
2- Requested memory = Current data memory *2- Last memory condition
3- Sort : Before overflowing to disk file , According to key The existing data in the memory data structure is 【 Sort 】.
4- Overflow disk : After sorting , Data will be written to disk files in batches .* default batch The number is 10000 strip , in other words , Sorted data , With every batch of 1 Ten thousand pieces of data are written to disk files in batches .*
5- Merge :* In the file start offset And end offset.* Indicates the file index
ByPass Mechanism ( There is no need to sort )
When shuffle write task Less than or equal to 【spark.shuffle.sort.bypassMergeThreshold】 The value of the parameter ( The default is 【200】)
It can't be 【 with map End aggregated shuffle operator 】.
reduceByKey yes map End aggregate class shuffle operator .
groupBykey No map End aggregate class shuffle operator .
边栏推荐
- [Li Kou brushing questions] one dimensional dynamic planning record (53 change exchanges, 300 longest increasing subsequence, 53 largest subarray and)
- C language char, wchar_ t, char16_ t, char32_ Relationship between T and character set
- JS get array subscript through array content
- MySQL - transaction details
- Explain ESM module and commonjs module in simple terms
- Web开发小妙招:巧用ThreadLocal规避层层传值
- 记一次清理挖矿病毒的过程
- OneNote in-depth evaluation: using resources, plug-ins, templates
- The underlying implementation of string
- C language: comprehensive application of if, def and ifndef
猜你喜欢
![[Digital IC manual tearing code] Verilog automatic beverage machine | topic | principle | design | simulation](/img/75/c0656c4890795bd65874b4f2b16462.jpg)
[Digital IC manual tearing code] Verilog automatic beverage machine | topic | principle | design | simulation

What can one line of code do?

对话阿里巴巴副总裁贾扬清:追求大模型,并不是一件坏事

一行代码可以做些什么?

ViT论文详解
![[redis design and implementation] part I: summary of redis data structure and objects](/img/2e/b147aa1e23757519a5d049c88113fe.png)
[redis design and implementation] part I: summary of redis data structure and objects

Why do job hopping take more than promotion?

jvm:大对象在老年代的分配

C# 如何在dataGridView里设置两个列comboboxcolumn绑定级联事件的一个二级联动效果

嵌入式开发的7大原罪
随机推荐
mysql根据两个字段去重
MySQL - 事务(Transaction)详解
Efficiency tool +wps check box shows the solution to the sun problem
50 commonly used numpy function explanations, parameters and usage examples
Four common ways and performance comparison of ArrayList de duplication (jmh performance analysis)
[Li Kou brushing questions] one dimensional dynamic planning record (53 change exchanges, 300 longest increasing subsequence, 53 largest subarray and)
In JS, string and array are converted to each other (I) -- the method of converting string into array
Reinforcement learning - learning notes 5 | alphago
JPEG2000 matlab source code implementation
This year, Jianzhi Tencent
Microsoft technology empowerment position - February course Preview
Comparison between multithreaded CAS and synchronized
[interpretation of the paper] machine learning technology for Cataract Classification / classification
数字化转型挂帅复产复工,线上线下全融合重建商业逻辑
Web开发小妙招:巧用ThreadLocal规避层层传值
互联网快讯:吉利正式收购魅族;胰岛素集采在31省全面落地
Divide candy
uni-app App端半屏连续扫码
Technology sharing | packet capturing analysis TCP protocol
The role of applicationmaster in spark on Yan's cluster mode