当前位置:网站首页>Sparkshuffle process and Mr shuffle process
Sparkshuffle process and Mr shuffle process
2022-07-06 21:44:00 【Big data Xiaochen】
SparkShuffle
Spark1.2 In the version 【HashShuffle】
Spark1.2 In later versions 【sortShuffle】
MR Of shuffle
Spark Of shuffle

HashshuffleManager( Abandon )
Unoptimized HashShuffle, Number of small files in the middle =【 The upstream task Number 】*【 The downstream task Number 】, To many, many
groupby and join Can trigger shuffle. The figure below shows the original join situation , Simplified to groupby Aggregation
The optimized HashShuffle, Number of small files in the middle =【Executor The number of 】*【 The downstream task The number of 】, The number has decreased exponentially .
groupby and join Can trigger shuffle.

sortshuffleManager( choose )
Common mechanisms ( Need to sort )

1- Define the data structure : If it is reduceByKey This kind of aggregate class shuffle operator , Then I will choose 【Map】 data structure , If it is join such shuffle operator , Then I will choose 【Array】 data structure
2- Requested memory = Current data memory *2- Last memory condition
3- Sort : Before overflowing to disk file , According to key The existing data in the memory data structure is 【 Sort 】.
4- Overflow disk : After sorting , Data will be written to disk files in batches .* default batch The number is 10000 strip , in other words , Sorted data , With every batch of 1 Ten thousand pieces of data are written to disk files in batches .*
5- Merge :* In the file start offset And end offset.* Indicates the file index
ByPass Mechanism ( There is no need to sort )
When shuffle write task Less than or equal to 【spark.shuffle.sort.bypassMergeThreshold】 The value of the parameter ( The default is 【200】)
It can't be 【 with map End aggregated shuffle operator 】.
reduceByKey yes map End aggregate class shuffle operator .
groupBykey No map End aggregate class shuffle operator .
边栏推荐
- The difference between break and continue in the for loop -- break completely end the loop & continue terminate this loop
- R3live notes: image processing section
- Yyds dry inventory run kubeedge official example_ Counter demo counter
- uni-app App端半屏连续扫码
- 爬虫实战(五):爬豆瓣top250
- 缓存更新策略概览(Caching Strategies Overview)
- 1D convolution detail
- 14年本科毕业,转行软件测试,薪资13.5K
- Fastjson parses JSON strings (deserialized to list, map)
- 代理和反向代理
猜你喜欢

Fastjson parses JSON strings (deserialized to list, map)

ViT论文详解
![Happy sound 2[sing.2]](/img/ca/1581e561c427cb5b9bd5ae2604b993.jpg)
Happy sound 2[sing.2]

Sequoia China, just raised $9billion
![Leetcode topic [array] -118 Yang Hui triangle](/img/77/d8a7085968cc443260b4c0910bd04b.jpg)
Leetcode topic [array] -118 Yang Hui triangle

uni-app App端半屏连续扫码

Summary of cross partition scheme

Is this the feeling of being spoiled by bytes?

Vit paper details

快讯:飞书玩家大会线上举行;微信支付推出“教培服务工具箱”
随机推荐
Proxy and reverse proxy
Uni app app half screen continuous code scanning
抖音将推独立种草App“可颂”,字节忘不掉小红书?
Binary tree node at the longest distance
Caching strategies overview
guava:Collections.unmodifiableXXX创建的collection并不immutable
Dialogue with Jia Yangqing, vice president of Alibaba: pursuing a big model is not a bad thing
El table table - sortable sorting & disordered sorting when decimal and% appear
Divide candy
Sql: stored procedures and triggers - Notes
Sequoia China, just raised $9billion
c语言char, wchar_t, char16_t, char32_t和字符集的关系
Start the embedded room: system startup with limited resources
基于InsightFace的高精度人脸识别,可直接对标虹软
MySQL - 事务(Transaction)详解
FZU 1686 龙之谜 重复覆盖
038. (2.7) less anxiety
50个常用的Numpy函数解释,参数和使用示例
Search map website [quadratic] [for search map, search fan, search book]
Vim 基本配置和经常使用的命令