当前位置：网站首页>Spark data skew solution

Spark data skew solution

2022-07-27 00:59:00 【A photographer who can't play is not a good programmer】

Spark Data skew solution

One Spark The program will be based on its internal Action Operations are divided into multiple job, Each job will be based on shuffle Operations are divided into multiple Stage, Every Stage By multiple Task Tasks are calculated in parallel , Every Task The task only calculates the data of one partition .
Spark Data skew is a lot of the same key Entered the same partition .

Preface

The idea of solving data skew is to ensure that every Task The data calculated by the task should be as uniform as possible . The data is arriving Spark It is of course the best to divide evenly before , If you can't , It is necessary to process the data , Try to ensure the uniformity of data , Then calculate , This avoids data skew .

One 、 terms of settlement

1. Data preprocessing

hypothesis Spark The data source comes from Hive, So it can be Hive Data preprocessing in , Try to ensure the uniformity of data . Or in Hive Aggregate data in advance , When data is transferred in Spark In the after , There is no need to do it again reduceByKeyO And so on , period Shuffle Stage , Thus avoiding data skew .

2. Filtering causes data skew key

If data skew occurs key No practical significance （ For example, there are many key yes “-”), There will be no impact on the business , Then you can use Spark When reading data, use filter（） Operator filters it out , Filtered out key Will not participate in the following calculation , Thus eliminating data skew .

3. Improve shuffle Parallelism of operations

Spark RDD Of Shuffle Process and MapReduce similar , Repartition and repartition of data . If the number of partitions （ Parallelism ） Not set properly , It is likely to cause a lot of different key Assigned to the same partition , Lead to sth Task The amount of data processed by the task is much larger than others Task Mission , Cause data skew . You can use aggregation operators ( for example groupByKey()、countByKey()、reduceByKey() etc. ） Pass in a parameter to specify the number of partitions ( Parallelism ）, This allows the original to be assigned to a Task Multiple tasks key To assign to more than one Task Mission , Let each Task The task processes less data than before , So as to effectively alleviate and mitigate the impact of data skew . for example , On a certain RDD perform reduceByKeyO Operator time , You can pass in a parameter , namely reduceByKey(20), This parameter can specify Shuffle Parallelism of operations , That is, the number of partitions after data reorganization

4. Use random key Double polymerization

In the same key Add random numbers as prefixes , Will be the same key Become many different key, In this way, the partition can be allocated to key Spread across multiple partitions , Thus using multiple Task Task to process , Xie Ge Task Dealing with too much data . Here is just a local aggregation , Then remove each key Random forward global aggregation , You can get the final result , To avoid data skew . This way of using double aggregation to avoid data dump is Spark It's suitable for groupByKey() and reduceByKey() Equal aggregation class operator .

原网站

版权声明
本文为[A photographer who can't play is not a good programmer]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207262238331512.html