当前位置:网站首页>Spark data skew solution
Spark data skew solution
2022-07-27 00:59:00 【A photographer who can't play is not a good programmer】
Spark Data skew solution
One Spark The program will be based on its internal Action Operations are divided into multiple job, Each job will be based on shuffle Operations are divided into multiple Stage, Every Stage By multiple Task Tasks are calculated in parallel , Every Task The task only calculates the data of one partition .
Spark Data skew is a lot of the same key Entered the same partition .
Data skew
Preface
The idea of solving data skew is to ensure that every Task The data calculated by the task should be as uniform as possible . The data is arriving Spark It is of course the best to divide evenly before , If you can't , It is necessary to process the data , Try to ensure the uniformity of data , Then calculate , This avoids data skew .
One 、 terms of settlement
1. Data preprocessing
hypothesis Spark The data source comes from Hive, So it can be Hive Data preprocessing in , Try to ensure the uniformity of data . Or in Hive Aggregate data in advance , When data is transferred in Spark In the after , There is no need to do it again reduceByKeyO And so on , period Shuffle Stage , Thus avoiding data skew .
2. Filtering causes data skew key
If data skew occurs key No practical significance ( For example, there are many key yes “-”), There will be no impact on the business , Then you can use Spark When reading data, use filter() Operator filters it out , Filtered out key Will not participate in the following calculation , Thus eliminating data skew .
3. Improve shuffle Parallelism of operations
Spark RDD Of Shuffle Process and MapReduce similar , Repartition and repartition of data . If the number of partitions ( Parallelism ) Not set properly , It is likely to cause a lot of different key Assigned to the same partition , Lead to sth Task The amount of data processed by the task is much larger than others Task Mission , Cause data skew . You can use aggregation operators ( for example groupByKey()、countByKey()、reduceByKey() etc. ) Pass in a parameter to specify the number of partitions ( Parallelism ), This allows the original to be assigned to a Task Multiple tasks key To assign to more than one Task Mission , Let each Task The task processes less data than before , So as to effectively alleviate and mitigate the impact of data skew . for example , On a certain RDD perform reduceByKeyO Operator time , You can pass in a parameter , namely reduceByKey(20), This parameter can specify Shuffle Parallelism of operations , That is, the number of partitions after data reorganization
4. Use random key Double polymerization
In the same key Add random numbers as prefixes , Will be the same key Become many different key, In this way, the partition can be allocated to key Spread across multiple partitions , Thus using multiple Task Task to process , Xie Ge Task Dealing with too much data . Here is just a local aggregation , Then remove each key Random forward global aggregation , You can get the final result , To avoid data skew . This way of using double aggregation to avoid data dump is Spark It's suitable for groupByKey() and reduceByKey() Equal aggregation class operator .
边栏推荐
- Spark源码学习——Memory Tuning(内存调优)
- JSCORE day_ 01(6.30) RegExp 、 Function
- DOM day_ 01 (7.7) introduction and core operation of DOM
- Vector size performance problems
- Promise基本用法 20211130
- BUUCTF-随便注、Exec、EasySQL、Secret File
- Alibaba internal "shutter" core advanced notes~
- [ciscn2019 North China division Day1 web2]ikun
- MYSQL 使用及实现排名函数RANK、DENSE_RANK和ROW_NUMBER
- Programmers must do 50 questions
猜你喜欢
随机推荐
js中this指向详解
箭头函数详解 2021-04-30
[RootersCTF2019]I_< 3_ Flask
JSCORE day_ 04(7.5)
Designer mode
Only hard work, hard work and hard work are the only way out C - patient entity class
Learn json.stringify again
2022.7.16DAY606
Flink面试常见的25个问题(无答案)
[Network Research Institute] attackers scan 1.6 million WordPress websites to find vulnerable plug-ins
05 - 钓鱼网站的攻击与防御
[hongminggu CTF 2021] write_ shell
Medical data of more than 4000 people has been exposed for 16 years
07 - 日志服务器的搭建与攻击
forward和redirect的区别
Consistency inspection and evaluation method kappa
10 - CentOS 7 上部署MySql
Solve the problem that there is no ado.net entity data model in vs
Spark source code learning - Data Serialization
[HarekazeCTF2019]encode_and_encode




![[CISCN2019 华北赛区 Day1 Web2]ikun](/img/80/53f8253a80a80931ff56f4e684839e.png)

![[WUSTCTF2020]CV Maker](/img/64/06023938e83acc832f06733b6c4d63.png)
![[HFCTF2020]EasyLogin](/img/23/91912865a01180ee191a513be22c03.png)
![[NPUCTF2020]ezinclude](/img/24/ee1a6d49a74ce09ec721c1a3b5dce4.png)
