当前位置:网站首页>About the transmission pipeline of stage in spark

About the transmission pipeline of stage in spark

2022-07-01 04:31:00 Daily Xiaoxin

About Spark in Stage The transmission of Pipeline


First pipeline Pipeline computing mode ,pipeline It's just a computational idea , A pattern , Follow MR differ ,pipeline It is only when the logic is completely completed that the result will be landing ,MR Is to calculate the persistent disk , recompute , This is also MR And Spark The root cause of the speed gap ( Code implementation Stage Medium Pipeline)


object Pipeline {
    
  def main(args: Array[String]): Unit = {
    
    // Create connection 
    val conf = new SparkConf()
    conf.setMaster("local").setAppName("PPLine");
    val sc = new SparkContext(conf)
    // Set an array of two concurrent numbers (2 Partition number )
    val rdd = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8), 2)
  
    val rdd1: RDD[Int] = rdd.map(x => {
    
      println("rdd1----[pid" + TaskContext.get.partitionId + "]----" + x)
      x
    })
    val rdd2: RDD[Int] = rdd1.filter(x => {
    
      println("rdd2----[pid" + TaskContext.get.partitionId + "]----" + x)
      true
    })
    val rdd3: RDD[(String, Int)] = rdd2.map(x => {
    
      println("rdd3----[pid" + TaskContext.get.partitionId + "]----" + x)
      Tuple2("yjx" + x % 3, x)
    })

    // stay RDD4 To perform a partition operation in , stay RDD4 The operator is then divided into a single stage Used to calculate (2——>4 Meet wide dependence )
    val rdd4: RDD[(String, Int)] = rdd3.reduceByKey((sum: Int, value: Int) => {
    
      println("rdd4----[pid" + TaskContext.get.partitionId + "]----" + sum + "---" + value)
      sum + value
    }, 3)

    val rdd5: RDD[(String, Int)] = rdd4.map(x => {
    
      println("rdd5----[pid" + TaskContext.get.partitionId + "]----" + x)
      x
    })
    // Start operator 
    rdd5.count
    sc.stop()
  }
}

The above code shows that a set of data is in pipeline Execution order of , Divided into two stage, First, RDD1RDD4 A simple print filter between (`stage1`), Then the RDD4RDD5 Between a calculation print (stage2)


// The meaning is as follows :
  /* stage1( for(i=0;i<2;i++){ new Thread( textFile-->rdd1() rdd2() rdd3()-->ShuffleWrite ).start() } ) stage2( for(i=0;i<1;i++){ new Thread( shuffleRead--->rdd4() rdd5() ).start() } ) */


 Insert picture description here


 Insert picture description here

原网站

版权声明
本文为[Daily Xiaoxin]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202160253215447.html