当前位置:网站首页>Understand the difference between reducebykey and groupbykey in spark
Understand the difference between reducebykey and groupbykey in spark
2022-06-13 03:34:00 【TRX1024】
Catalog
One 、 Look at the conclusion first.
Two 、 give an example 、 Drawing description
1. What are the functions implemented ?
1).groupByKey Realization WordCount
2).reduceByKey Realization WordCount
2. Draw and analyze the difference between the two implementation methods
1) groupByKey Realization WordCount
2).reduceByKey Realization WordCount( Simple process )
3).reduceByKey Realization WordCount( The ultimate process )
One 、 Look at the conclusion first.
1. from Shuffle The angle of
reduceByKey and groupByKey All exist shuffle operation , however reduceByKey Can be in shuffle For the same partition key The data set of Prepolymerization (combine) function , This will reduce the amount of data falling on the disk , and groupByKey Just grouping , There is no problem of data reduction ,reduceByKey High performance .
2. From a functional point of view
reduceByKey In fact, it contains the function of grouping and aggregation ;groupByKey You can only group , Can't aggregate , So in the case of group aggregation , Recommended reduceByKey, If it's just grouping without aggregation , Then you can only use groupByKey.
Two 、 give an example 、 Drawing description
1. What are the functions implemented ?
For the sake of understanding , Two operators are used to realize WordCount Program . Suppose the word has been processed into (word,1) In the form of , I use List(("a", 1), ("a", 1), ("a", 1), ("b", 1)) As a data source .
1).groupByKey Realization WordCount
function :groupByKey The data of the data source can be divided into key Yes value Grouping
So first of all , Use alone groupByKey, What is its return value
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
// obtain RDD
val rdd = sc.makeRDD(List(("a", 1), ("a", 1), ("a", 1), ("b", 1)))
val reduceRDD = rdd.groupByKey()
reduceRDD.collect().foreach(println)
sc.stop()
/**
* Running results :
* (a,CompactBuffer(1, 1, 1))
* (b,CompactBuffer(1))
*/
}You can see , The result is RDD[(String, Iterable[Int])], That is to say (a,(1,1,1)),(b,(1,1,1)).
To achieve WordCount, One more step is needed Map operation :
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
// obtain RDD
val rdd = sc.makeRDD(List(("a", 1), ("a", 1), ("a", 1), ("b", 1)))
val reduceRDD = rdd.groupByKey().map {
case (word, iter) => {
(word, iter.size)
}
}
reduceRDD.collect().foreach(println)
sc.stop()
/**
* Running results :
* (a,3)
* (b,1)
*/
}2).reduceByKey Realization WordCount
function :reduceByKey The data can be in the same Key Yes Value Make a pairwise polymerization , This aggregation method needs to be specified .
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
val sc = new SparkContext(sparkConf)
// obtain RDD
val rdd = sc.makeRDD(List(("a", 1), ("a", 1), ("a", 1), ("b", 1)))
// Specify the calculation formula as x+y
val reduceRDD = rdd.reduceByKey((x,y) => x + y)
reduceRDD.collect().foreach(println)
sc.stop()
/**
* Running results :
* (a,3)
* (b,1)
*/
}2. Draw and analyze the difference between the two implementation methods
For the convenience of demonstration Shuffle The process , Now suppose there are two partitions of data .
1) groupByKey Realization WordCount

Reading :
1. Red RDD Data source , Contains two partitions (word,1) data
2.Shuffle The process ( We all know Shuffle The process requires disks IO Of )
3.groupByKey After RDD, according to key Group pair Value Aggregate
4.Map Operation calculation WordCount
summary :groupbykey It will cause data disruption and reorganization , There is shuffle operation .
2).reduceByKey Realization WordCount( Simple process )

Reading :
1. Red RDD Data source , Contains two partitions (word,1) data
2.Shuffle The process
3. According to the specified aggregation formula , Yes Value The result of pairwise aggregation RDD
Come here to see , Feeling groupbykey and reduceByKey Realization WordCount The calculation method is similar , In terms of performance , There are Shuffle operation , So it doesn't make much difference in terms of computing performance ; functionally , All have groups , It's just reduceByKey There are aggregation operations , and groupbykey There is no aggregation operation , It is aggregated by increasing map Operation to achieve , So it doesn't seem to make much difference .
So what is the core difference between them ?
3).reduceByKey Realization WordCount( The ultimate process )
Once again reduceByKey Function introduction of : The data can be in the same Key Yes Value Make a pairwise polymerization .
Think about a problem : from 2) Is there a phenomenon in the figure of , In red RDD There is the same in a partition of Key, and value It can be aggregated . stay groupbykey In the process of implementation , because groupbykey No aggregation , The implementation of aggregation calculation is to aggregate all data after grouping . and reduceByKey It has aggregation function , In the process of implementation , The aggregation condition is also satisfied before grouping ( Have the same key,value Can polymerize ), that reduceByKey Are the data aggregated before grouping ?( The answer is yes , We call it a pre aggregation operation )
therefore , Its flow chart becomes like this :

Reading :
1. Red RDD Data source , Contains two partitions (word,1) data , Pre aggregate the data in the partition before grouping
2.Shuffle operation
3. According to the specified aggregation formula , Yes Value The result of pairwise aggregation RDD
What are the changes ?
1. Data is pre aggregated before grouping , The amount of data participating in the grouping becomes smaller , That is, participate in Shuffle The amount of data becomes smaller
2. Because of participation Shuffle The amount of data becomes smaller , therefore Shuffle Disk at IO The number of times will be reduced
3. The aggregation calculation time and quantity calculation times become less
From this we can draw a conclusion :
reduceByKey Support the pre aggregation function within the partition , Can effectively reduce Shuffle The amount of data falling on the disk , promote Shuffle Performance of .
边栏推荐
- Neil eifrem, CEO of neo4j, interprets the chart data platform and leads the development of database in the next decade
- Brew tool - "fatal: could not resolve head to a revision" error resolution
- Understanding the ongdb open source map data foundation from the development of MariaDB
- 视频播放屡破1000W+,在快手如何利用二次元打造爆款
- 2000-2019 enterprise registration data of all provinces, cities and counties in China (including longitude and latitude, registration number and other multi indicator information)
- 2021-08-30 distributed cluster
- MySQL and PostgreSQL installation subtotal
- Common command records of redis client
- English语法_方式副词-位置
- This article takes you to learn DDD, basic introduction
猜你喜欢
![[azure data platform] ETL tool (8) - ADF dataset and link service](/img/bf/d6d3a8c1139bb8d38ab9ee1ab9754e.jpg)
[azure data platform] ETL tool (8) - ADF dataset and link service

MASA Auth - SSO与Identity设计

Video playback has repeatedly broken 1000w+, how to use the second dimension to create a popular model in Kwai
![[azure data platform] ETL tool (1) -- Introduction to azure data factory](/img/0c/cd054c65aee6db5ae690f104db58a3.jpg)
[azure data platform] ETL tool (1) -- Introduction to azure data factory

Quickly obtain the attributes of the sub graph root node

Time processing class in PHP

Microservice practice based on rustlang

Complex network analysis capability based on graph database
![Data Governance Series 1: data governance framework [interpretation and analysis]](/img/d9/1476d0ee2c82f5cdd70b4cffaba423.jpg)
Data Governance Series 1: data governance framework [interpretation and analysis]
![[JVM Series 5] JVM tuning instance](/img/29/271fa25a338ee1268f7bce58673e11.jpg)
[JVM Series 5] JVM tuning instance
随机推荐
Doris data import broker load
The use of curl in PHP
Pollution discharge fees of listed companies 2010-2020 & environmental disclosure level of heavy pollution industry - original data and calculation results
Coal industry database - coal price, consumption, power generation & Provincial Civil and industrial power consumption data
How to Draw Useful Technical Architecture Diagrams
Parallel one degree relation query
ONNX+TensorRT+YoloV5:基于trt+onnx得yolov5部署1
China Civil Aviation Statistical Yearbook (1996-2020)
Graph data modeling tool
Complex network analysis capability based on graph database
Data of all bank outlets in 356 cities nationwide (as of February 13, 2022)
C simple understanding - arrays and sets
Transaction processing in PDO
Use of compact, extract and list functions in PHP
Technical documentbookmark
Yolov5 face+tensorrt: deployment based on win10+tensorrt8.2+vs2019
Azure SQL db/dw series (10) -- re understanding the query store (3) -- configuring the query store
Isolation level, unreal read, gap lock, next key lock
Time processing class in PHP
Redis memory optimization and distributed locking