当前位置:网站首页>Common ideas of sparksql dealing with data skew
Common ideas of sparksql dealing with data skew
2022-06-09 06:24:00 【lixia0417mul2】
hypothesis spark There is such a table in which the number of users' fans is stored user_fan, Field is user id–userId, fans id --fanId, Now ask for the number of fans per user , We have the following sql:
select userId,count(1) as cnt from user_fan group by userId
We know that the number of fans is uneven , Some users have tens of millions of fans , Some users have only dozens of fans , Such a sql The result is a partition with a large amount of data task It takes a long time , Partitions with a small amount of data only take a very short time , Of course, if every task Partition happens to be evenly distributed , such as A,B Users have 100w fans ,C,D Users have 10 Fans , If partition A and C Users are assigned to the same partition ,B and D Users are assigned to the same partition , Then there will be no problem at this time , But it's just a coincidence , In most cases , Partition according to the file size , He doesn't care what business means , So this leads to A,B,C In the same partition ,D In another partition , So the data will be very uneven , So what we can do here is to try to make each partition more uniform , Then how to do it , We know that groups are grouped by user dimension , It will certainly not achieve its goal , How to do it ?
What we have to do is divide and rule :
Achieve one :
select userId,count(1) as cnt from
(select userId,fanMod,count(1) as ufcnt from
(select userId,fanId,fanId % 10 as fanMod from user_fan) as a group by userId,fanMod) as b group by userId
First according to userId + fanMod Group in the same way , among fanMod We can take 10 perhaps 100 etc. , This is based on userid+fanMod Each group of data is almost uniform , So the first stage gets userId+fanMod Corresponding count Each of the tables task The running time is uniform , You can take full advantage of parallelism cpu Speed up . Then, in the second stage, we can get userId Corresponding count surface , In this stage, for users with a large number of fans, the corresponding fanMod Quantity is, for example 10 perhaps 100, For users with small fans, it may be 1, But because the difference in quantity is very small ( There will be no similar before sql There is a gap of millions between users and fans ), So the second stage task The running speed will not vary greatly , In this way, the goal of divide and rule has been achieved .
Achieve two :
The idea to realize the second is to first divide the records of users with large fans into a group , User records with a small number of followers are divided into another group , The current premise is that you know in advance which users have a large number of fans , Which users have a small number of fans , Of course, the quantity does not need to be too precise , Then calculate the number of fans of each user for each group of aggregation , Because the number of fans of each user in each group is on an order of magnitude , Therefore, it will not cause the skew of the data volume , In this way, after each group obtains the results, the results of each group are combined to get the final results .
-- Big fan group
select userId,count(1) as cnt from user_fan where user in ( Big fan group ) group by userId
union
-- Small fan group
select userId,count(1) as cnt from user_fan where user in ( Small fan group ) group by userId
边栏推荐
- 你真的懂熵了吗(含交叉熵)
- Ipop-imx6q development board qt5.7 system Mplayer migration - Cross compilation libmad-0.15.1b
- Mt2712 Display Debug Method
- Vs2013 secret key
- Gh-bladed4.9 lidar module
- 全志T3(A40I)/T5(T507)性能对比,一代更比一代强
- Warning : `load_ model` does not return WordVectorModel or SupervisedModel any more, but a `FastText`
- Coredns Part 1 Introduction and installation
- sudo: gedit:找不到命令
- 你真的懂熵了嗎(含交叉熵)
猜你喜欢

Xiaomi 4 failed to install wechat

Coredns part 3-access Prometheus monitoring

LDAP application: openldap integrated into jumpserver

RNN and its improved version (with 2 code cases attached)

SQLServer 导入导出数据,后台有进程,前台无显示。

Mutual exclusion and synchronization in kernel
Unity location service GPS API

Postman 安装

Exponential moving weighted average

Coredns part 4-compiling and installing unbound
随机推荐
Adam neural network
How matlab writes continuous data with title to mat file
Yocto compiling libdrm
Etc. sudo permission configuration
MySQL 联合查询
ImportError: cannot import name ‘joblib‘ from ‘sklearn. externals‘
RuntimeError: Dataset not found or corrupted. You can use download=True to download it
香蕉派 BPI-M2 Ultra的缩小版-CoM-X40I核心板
Unity3d change item font
C generic constraint
Xiaomi 4 failed to install wechat
Lazy counter
基於國產全志A40I的機器人示教器解决方案
Itop-2k1000 development board startup ramdisk production startup USB flash disk
C iterator
工业级AM335X核心模块选型
TypeScript
Gh-bladed4.9 lidar module
C # characteristic
cms 和 g1的主要区别