当前位置:网站首页>RDD partition rules of spark

RDD partition rules of spark

2022-07-06 02:04:00 Diligent ls

1.RDD Data is created from a collection

a. Do not specify partition       

         Create... From collection rdd, If you do not write the number of partitions manually , The default number of partitions is the same as that of local mode cpu The number of cores is related to

         local : 1 individual   local[*] : Number of all cores of notebook    local[K]:K individual

b. The specified partition

object fenqu {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkCoreTest")
    val sc: SparkContext = new SparkContext(conf)

    //1)4 Data , Set up 4 Zones , Output :0 Partition ->1,1 Partition ->2,2 Partition ->3,3 Partition ->4
    val rdd: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4), 4)

    //2)4 Data , Set up 3 Zones , Output :0 Partition ->1,1 Partition ->2,2 Partition ->3,4
    //val rdd: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4), 3)

    //3)5 Data , Set up 3 Zones , Output :0 Partition ->1,1 Partition ->2、3,2 Partition ->4、5
    //val rdd: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4, 5), 3)

    rdd.saveAsTextFile("output")

    sc.stop()
  }
}

The rules

         The starting position of the partition = ( Zone number * Total data length )/ Total number of divisions

         End of partition =(( Zone number + 1)* Total data length )/ Total number of divisions

2. Create after reading in the file

a. Default

         The default value is the current number of cores and 2 The minimum value of , It's usually 2

b. Appoint

1). How to calculate the number of partitions :

totalSize = 10

goalSize = 10 / 3 = 3(byte) Indicates that each partition stores 3 Bytes of data

Partition number = totalSize/ goalSize = 10 /3 => 3,3,4

4 Subsection greater than 3 Subsection 1.1 times , accord with hadoop section 1.1 Double strategy , Therefore, an additional partition will be created , That is, there are 4 Zones  3,3,3,1

2). Spark Read the file , It's using hadoop Read by , So read line by line , It has nothing to do with the number of bytes

3). The calculation of data reading position is in the unit of offset .

4). Calculation of offset range of data partition

        0 => [0,3]         1     012        0 => 1,2

        1 => [3,6]         2     345        1 => 3        

        2 => [6,9]         3     678        2 => 4

        3 => [9,9]         4      9           3 =>  nothing

原网站

版权声明
本文为[Diligent ls]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202140042490956.html