当前位置：网站首页>RDD partition rules of spark

RDD partition rules of spark

2022-07-06 02:04:00 【Diligent ls】

1.RDD Data is created from a collection

a. Do not specify partition

Create... From collection rdd, If you do not write the number of partitions manually , The default number of partitions is the same as that of local mode cpu The number of cores is related to

local : 1 individual local[*] : Number of all cores of notebook local[K]:K individual

b. The specified partition

object fenqu {
  def main(args: Array[String]): Unit = {
    val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkCoreTest")
    val sc: SparkContext = new SparkContext(conf)

    //1）4 Data , Set up 4 Zones , Output ：0 Partition ->1,1 Partition ->2,2 Partition ->3,3 Partition ->4
    val rdd: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4), 4)

    //2）4 Data , Set up 3 Zones , Output ：0 Partition ->1,1 Partition ->2,2 Partition ->3,4
    //val rdd: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4), 3)

    //3）5 Data , Set up 3 Zones , Output ：0 Partition ->1,1 Partition ->2、3,2 Partition ->4、5
    //val rdd: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4, 5), 3)

    rdd.saveAsTextFile("output")

    sc.stop()
  }
}

The rules

The starting position of the partition = ( Zone number * Total data length )/ Total number of divisions

End of partition =(( Zone number + 1)* Total data length )/ Total number of divisions

2. Create after reading in the file

a. Default

The default value is the current number of cores and 2 The minimum value of , It's usually 2

b. Appoint

1). How to calculate the number of partitions :

totalSize = 10

goalSize = 10 / 3 = 3(byte) Indicates that each partition stores 3 Bytes of data

Partition number = totalSize/ goalSize = 10 /3 => 3,3,4

4 Subsection greater than 3 Subsection 1.1 times , accord with hadoop section 1.1 Double strategy , Therefore, an additional partition will be created , That is, there are 4 Zones 3,3,3,1

2). Spark Read the file , It's using hadoop Read by , So read line by line , It has nothing to do with the number of bytes

3). The calculation of data reading position is in the unit of offset .

4). Calculation of offset range of data partition

0 => [0,3] 1 012 0 => 1,2

1 => [3,6] 2 345 1 => 3

2 => [6,9] 3 678 2 => 4

3 => [9,9] 4 9 3 => nothing

原网站

版权声明
本文为[Diligent ls]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202140042490956.html