当前位置:网站首页>Spark data skew solution
Spark data skew solution
2022-07-27 00:59:00 【A photographer who can't play is not a good programmer】
Spark Data skew solution
One Spark The program will be based on its internal Action Operations are divided into multiple job, Each job will be based on shuffle Operations are divided into multiple Stage, Every Stage By multiple Task Tasks are calculated in parallel , Every Task The task only calculates the data of one partition .
Spark Data skew is a lot of the same key Entered the same partition .
Data skew
Preface
The idea of solving data skew is to ensure that every Task The data calculated by the task should be as uniform as possible . The data is arriving Spark It is of course the best to divide evenly before , If you can't , It is necessary to process the data , Try to ensure the uniformity of data , Then calculate , This avoids data skew .
One 、 terms of settlement
1. Data preprocessing
hypothesis Spark The data source comes from Hive, So it can be Hive Data preprocessing in , Try to ensure the uniformity of data . Or in Hive Aggregate data in advance , When data is transferred in Spark In the after , There is no need to do it again reduceByKeyO And so on , period Shuffle Stage , Thus avoiding data skew .
2. Filtering causes data skew key
If data skew occurs key No practical significance ( For example, there are many key yes “-”), There will be no impact on the business , Then you can use Spark When reading data, use filter() Operator filters it out , Filtered out key Will not participate in the following calculation , Thus eliminating data skew .
3. Improve shuffle Parallelism of operations
Spark RDD Of Shuffle Process and MapReduce similar , Repartition and repartition of data . If the number of partitions ( Parallelism ) Not set properly , It is likely to cause a lot of different key Assigned to the same partition , Lead to sth Task The amount of data processed by the task is much larger than others Task Mission , Cause data skew . You can use aggregation operators ( for example groupByKey()、countByKey()、reduceByKey() etc. ) Pass in a parameter to specify the number of partitions ( Parallelism ), This allows the original to be assigned to a Task Multiple tasks key To assign to more than one Task Mission , Let each Task The task processes less data than before , So as to effectively alleviate and mitigate the impact of data skew . for example , On a certain RDD perform reduceByKeyO Operator time , You can pass in a parameter , namely reduceByKey(20), This parameter can specify Shuffle Parallelism of operations , That is, the number of partitions after data reorganization
4. Use random key Double polymerization
In the same key Add random numbers as prefixes , Will be the same key Become many different key, In this way, the partition can be allocated to key Spread across multiple partitions , Thus using multiple Task Task to process , Xie Ge Task Dealing with too much data . Here is just a local aggregation , Then remove each key Random forward global aggregation , You can get the final result , To avoid data skew . This way of using double aggregation to avoid data dump is Spark It's suitable for groupByKey() and reduceByKey() Equal aggregation class operator .
边栏推荐
- Flink 1.15实现 Sql 脚本从savepointh恢复数据
- DOM day_ 03 (7.11) event bubbling mechanism, event delegation, to-do items, block default events, mouse coordinates, page scrolling events, create DOM elements, DOM encapsulation operations
- Search engine realizes keyword highlighting
- Leetcode 301 week
- [红明谷CTF 2021]write_shell
- Redisson 工作原理-源码分析
- flink需求之—SideOutPut(侧输出流的应用:将温度大于30℃的输出到主流,低于30℃的输出到侧流)
- Promise基本用法 20211130
- Programmers must do 50 questions
- The difference between golang slice make and new
猜你喜欢
随机推荐
[By Pass] 文件上传的绕过方式
Export and import in ES6
Leetcode 301 week
[漏洞实战] 逻辑漏洞挖掘
数据库表连接的简单解释
Channel shutdown: channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=
BUUCTF-随便注、Exec、EasySQL、Secret File
flink1.11 sql本地运行demo & 本地webUI可视解决
Promise基本用法 20211130
Canal 介绍
哪个证券公司开户股票佣金低,哪个股票开户安全
5_ Linear regression
[HITCON 2017]SSRFme
VMware Workstation 虚拟机启动就直接蓝屏重启问题解决
[watevrCTF-2019]Cookie Store
[ciscn2019 North China Day1 web5] cyberpunk
2022.7.16DAY606
JSCORE day_ 03(7.4)
SparkSql之编程方式
(Spark调优~)算子的合理选择

![[SQL注入] 报错注入](/img/89/4809d427e307574cf73af668be3698.png)

![[CTF 真题] 2018-网鼎杯-Web-Unfinish](/img/d8/a367c26b51d9dbaf53bf4fe2a13917.png)




![[watevrCTF-2019]Cookie Store](/img/24/8baaa1ac9daa62c641472d5efac895.png)
![[CTF攻防世界] WEB区 关于Cookie的题目](/img/96/6e91ee19343a1ddc49dc2bc94cba62.png)