当前位置:网站首页>Common ideas of sparksql dealing with data skew

Common ideas of sparksql dealing with data skew

2022-06-09 06:24:00 lixia0417mul2

hypothesis spark There is such a table in which the number of users' fans is stored user_fan, Field is user id–userId, fans id --fanId, Now ask for the number of fans per user , We have the following sql:

select userId,count(1) as cnt from user_fan group by userId

We know that the number of fans is uneven , Some users have tens of millions of fans , Some users have only dozens of fans , Such a sql The result is a partition with a large amount of data task It takes a long time , Partitions with a small amount of data only take a very short time , Of course, if every task Partition happens to be evenly distributed , such as A,B Users have 100w fans ,C,D Users have 10 Fans , If partition A and C Users are assigned to the same partition ,B and D Users are assigned to the same partition , Then there will be no problem at this time , But it's just a coincidence , In most cases , Partition according to the file size , He doesn't care what business means , So this leads to A,B,C In the same partition ,D In another partition , So the data will be very uneven , So what we can do here is to try to make each partition more uniform , Then how to do it , We know that groups are grouped by user dimension , It will certainly not achieve its goal , How to do it ?

What we have to do is divide and rule :
Achieve one :

select userId,count(1) as cnt from
(select userId,fanMod,count(1) as ufcnt from
(select userId,fanId,fanId % 10 as fanMod from user_fan) as a group by userId,fanMod) as b group by userId

First according to userId + fanMod Group in the same way , among fanMod We can take 10 perhaps 100 etc. , This is based on userid+fanMod Each group of data is almost uniform , So the first stage gets userId+fanMod Corresponding count Each of the tables task The running time is uniform , You can take full advantage of parallelism cpu Speed up . Then, in the second stage, we can get userId Corresponding count surface , In this stage, for users with a large number of fans, the corresponding fanMod Quantity is, for example 10 perhaps 100, For users with small fans, it may be 1, But because the difference in quantity is very small ( There will be no similar before sql There is a gap of millions between users and fans ), So the second stage task The running speed will not vary greatly , In this way, the goal of divide and rule has been achieved .

Achieve two :
The idea to realize the second is to first divide the records of users with large fans into a group , User records with a small number of followers are divided into another group , The current premise is that you know in advance which users have a large number of fans , Which users have a small number of fans , Of course, the quantity does not need to be too precise , Then calculate the number of fans of each user for each group of aggregation , Because the number of fans of each user in each group is on an order of magnitude , Therefore, it will not cause the skew of the data volume , In this way, after each group obtains the results, the results of each group are combined to get the final results .

-- Big fan group 
select userId,count(1) as cnt from user_fan where user in ( Big fan group ) group by userId
union
-- Small fan group 
select userId,count(1) as cnt from user_fan where user in ( Small fan group ) group by userId
原网站

版权声明
本文为[lixia0417mul2]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/160/202206090623128715.html