当前位置:网站首页>Sparkshuffle process and Mr shuffle process
Sparkshuffle process and Mr shuffle process
2022-07-06 21:44:00 【Big data Xiaochen】
SparkShuffle
Spark1.2 In the version 【HashShuffle】
Spark1.2 In later versions 【sortShuffle】
MR Of shuffle
Spark Of shuffle
HashshuffleManager( Abandon )
Unoptimized HashShuffle, Number of small files in the middle =【 The upstream task Number 】*【 The downstream task Number 】, To many, many
groupby and join Can trigger shuffle. The figure below shows the original join situation , Simplified to groupby Aggregation
The optimized HashShuffle, Number of small files in the middle =【Executor The number of 】*【 The downstream task The number of 】, The number has decreased exponentially .
groupby and join Can trigger shuffle.
sortshuffleManager( choose )
Common mechanisms ( Need to sort )
1- Define the data structure : If it is reduceByKey This kind of aggregate class shuffle operator , Then I will choose 【Map】 data structure , If it is join such shuffle operator , Then I will choose 【Array】 data structure
2- Requested memory = Current data memory *2- Last memory condition
3- Sort : Before overflowing to disk file , According to key The existing data in the memory data structure is 【 Sort 】.
4- Overflow disk : After sorting , Data will be written to disk files in batches .* default batch The number is 10000 strip , in other words , Sorted data , With every batch of 1 Ten thousand pieces of data are written to disk files in batches .*
5- Merge :* In the file start offset And end offset.* Indicates the file index
ByPass Mechanism ( There is no need to sort )
When shuffle write task Less than or equal to 【spark.shuffle.sort.bypassMergeThreshold】 The value of the parameter ( The default is 【200】)
It can't be 【 with map End aggregated shuffle operator 】.
reduceByKey yes map End aggregate class shuffle operator .
groupBykey No map End aggregate class shuffle operator .
边栏推荐
猜你喜欢
Leetcode topic [array] -118 Yang Hui triangle
Internet News: Geely officially acquired Meizu; Intensive insulin purchase was fully implemented in 31 provinces
Microsoft technology empowerment position - February course Preview
uni-app App端半屏连续扫码
3D face reconstruction: from basic knowledge to recognition / reconstruction methods!
The difference between break and continue in the for loop -- break completely end the loop & continue terminate this loop
【力扣刷题】一维动态规划记录(53零钱兑换、300最长递增子序列、53最大子数组和)
PostgreSQL modifies the password of the database user
Is it profitable to host an Olympic Games?
Four common ways and performance comparison of ArrayList de duplication (jmh performance analysis)
随机推荐
代理和反向代理
JS according to the Chinese Alphabet (province) or according to the English alphabet - Za sort &az sort
Forward maximum matching method
Numpy download and installation
First batch selected! Tencent security tianyufeng control has obtained the business security capability certification of the ICT Institute
Guava: use of multiset
Sdl2 source analysis 7: performance (sdl_renderpresent())
一行代码可以做些什么?
Michael smashed the minority milk sign
Redistemplate common collection instructions opsforhash (IV)
string的底层实现
Happy sound 2[sing.2]
guava:Collections. The collection created by unmodifiablexxx is not immutable
JS traversal array and string
PostgreSQL 修改数据库用户的密码
Comparison between multithreaded CAS and synchronized
Quick access to video links at station B
Enhance network security of kubernetes with cilium
VIM basic configuration and frequently used commands
JS method to stop foreach