当前位置：网站首页>Understand the difference between reducebykey and groupbykey in spark

Understand the difference between reducebykey and groupbykey in spark

2022-06-13 03:34:00 【TRX1024】

Catalog

One 、 Look at the conclusion first.

Two 、 give an example 、 Drawing description

1. What are the functions implemented ？

1).groupByKey Realization WordCount

2).reduceByKey Realization WordCount

2. Draw and analyze the difference between the two implementation methods

1) groupByKey Realization WordCount

2).reduceByKey Realization WordCount（ Simple process ）

3).reduceByKey Realization WordCount（ The ultimate process ）

One 、 Look at the conclusion first.

1. from Shuffle The angle of

reduceByKey and groupByKey All exist shuffle operation , however reduceByKey Can be in shuffle For the same partition key The data set of Prepolymerization （combine） function , This will reduce the amount of data falling on the disk , and groupByKey Just grouping , There is no problem of data reduction ,reduceByKey High performance .

2. From a functional point of view

reduceByKey In fact, it contains the function of grouping and aggregation ;groupByKey You can only group , Can't aggregate , So in the case of group aggregation , Recommended reduceByKey, If it's just grouping without aggregation , Then you can only use groupByKey.

Two 、 give an example 、 Drawing description

1. What are the functions implemented ？

For the sake of understanding , Two operators are used to realize WordCount Program . Suppose the word has been processed into (word,1) In the form of , I use List(("a", 1), ("a", 1), ("a", 1), ("b", 1)) As a data source .

1).groupByKey Realization WordCount

function ：groupByKey The data of the data source can be divided into key Yes value Grouping

So first of all , Use alone groupByKey, What is its return value

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    //  obtain  RDD
    val rdd = sc.makeRDD(List(("a", 1), ("a", 1), ("a", 1), ("b", 1)))
    val reduceRDD = rdd.groupByKey()
    reduceRDD.collect().foreach(println)
    sc.stop()
    /**
     *  Running results ：
     * (a,CompactBuffer(1, 1, 1))
     * (b,CompactBuffer(1))
     */
  }

You can see , The result is RDD[(String, Iterable[Int])], That is to say (a,(1,1,1)),(b,(1,1,1)).

To achieve WordCount, One more step is needed Map operation ：

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    //  obtain  RDD
    val rdd = sc.makeRDD(List(("a", 1), ("a", 1), ("a", 1), ("b", 1)))
    val reduceRDD = rdd.groupByKey().map {
      case (word, iter) => {
        (word, iter.size)
      }
    }
    reduceRDD.collect().foreach(println)
    sc.stop()

    /**
     *  Running results ：
     * (a,3)
     * (b,1)
     */
  }

2).reduceByKey Realization WordCount

function ：reduceByKey The data can be in the same Key Yes Value Make a pairwise polymerization , This aggregation method needs to be specified .

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    //  obtain  RDD
    val rdd = sc.makeRDD(List(("a", 1), ("a", 1), ("a", 1), ("b", 1)))
    //  Specify the calculation formula as  x+y
    val reduceRDD = rdd.reduceByKey((x,y) => x + y)
    reduceRDD.collect().foreach(println)
    sc.stop()
    /**
     *  Running results ：
     * (a,3)
     * (b,1)
     */
  }

2. Draw and analyze the difference between the two implementation methods

For the convenience of demonstration Shuffle The process , Now suppose there are two partitions of data .

1) groupByKey Realization WordCount

Reading ：

1. Red RDD Data source , Contains two partitions (word,1) data

2.Shuffle The process （ We all know Shuffle The process requires disks IO Of ）

3.groupByKey After RDD, according to key Group pair Value Aggregate

4.Map Operation calculation WordCount

summary ：groupbykey It will cause data disruption and reorganization , There is shuffle operation .

2).reduceByKey Realization WordCount（ Simple process ）

Reading ：

1. Red RDD Data source , Contains two partitions (word,1) data

2.Shuffle The process

3. According to the specified aggregation formula , Yes Value The result of pairwise aggregation RDD

Come here to see , Feeling groupbykey and reduceByKey Realization WordCount The calculation method is similar , In terms of performance , There are Shuffle operation , So it doesn't make much difference in terms of computing performance ; functionally , All have groups , It's just reduceByKey There are aggregation operations , and groupbykey There is no aggregation operation , It is aggregated by increasing map Operation to achieve , So it doesn't seem to make much difference .

So what is the core difference between them ？

3).reduceByKey Realization WordCount（ The ultimate process ）

Once again reduceByKey Function introduction of ： The data can be in the same Key Yes Value Make a pairwise polymerization .

Think about a problem ： from 2) Is there a phenomenon in the figure of , In red RDD There is the same in a partition of Key, and value It can be aggregated . stay groupbykey In the process of implementation , because groupbykey No aggregation , The implementation of aggregation calculation is to aggregate all data after grouping . and reduceByKey It has aggregation function , In the process of implementation , The aggregation condition is also satisfied before grouping （ Have the same key,value Can polymerize ）, that reduceByKey Are the data aggregated before grouping ？（ The answer is yes , We call it a pre aggregation operation ）

therefore , Its flow chart becomes like this ：

Reading ：

1. Red RDD Data source , Contains two partitions (word,1) data , Pre aggregate the data in the partition before grouping

2.Shuffle operation

3. According to the specified aggregation formula , Yes Value The result of pairwise aggregation RDD

What are the changes ？

1. Data is pre aggregated before grouping , The amount of data participating in the grouping becomes smaller , That is, participate in Shuffle The amount of data becomes smaller

2. Because of participation Shuffle The amount of data becomes smaller , therefore Shuffle Disk at IO The number of times will be reduced

3. The aggregation calculation time and quantity calculation times become less

From this we can draw a conclusion ：

reduceByKey Support the pre aggregation function within the partition , Can effectively reduce Shuffle The amount of data falling on the disk , promote Shuffle Performance of .

原网站

版权声明
本文为[TRX1024]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202280529582666.html