当前位置：网站首页>Spark's action operator

Spark's action operator

2022-07-01 09:26:00 【Diligent ls】

Because the conversion operators are lazy , Not immediately executed , Execute only when encountering action operator .

Catalog

1.reduce()

polymerization ,f Function aggregation RDD All elements in , Aggregate the data in the partition first , Re aggregate inter partition data .

    val listRDD: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4),2)
    //  Action operator 
    // reduce
    //  There is calculation logic within and between partitions     In the partition, the elements are calculated in sequence    The first element is the initial value 
    //  Partitions are not necessarily     According to which partition is calculated first, which partition is used as the initial value 
    val i: Int = listRDD.reduce(_ -_)
    println(i)

2.collect()

In the driver , In array Array Returns all elements of the dataset in the form of .

3.count()

return RDD The number of elements in

    //count  Statistics RDD The number of 
    val l: Long = listRDD.count()
    println(l)

4.first()

return RDD The first element in

    // first When you take the first element    Be sure to take 0 Data from partition number 
    val i: Int = listRDD.first()
    println(i)

5.take()

Returns a RDD Before n An array of elements

    // take It can identify the specific partition data    from 0 Partition No. begins to fetch data 
    val ints: Array[Int] = listRDD.take(2)
    println(ints)

6.takeOrdered()

Return to the RDD The first after sorting n An array of elements

    // takeOrdered
    //  First pair rdd Sort data in    Then take the front n individual 
    //  If you need to sort in reverse   You need to fill in implicit parameters 
    val array: Array[Int] = listRDD.takeOrdered(3)
    val array1: Array[Int] = listRDD.takeOrdered(3) (Ordering[Int].reverse)
    println(array.toList)
    //  Take the data first and then sort 
    val sorted: Array[Int] = listRDD.take(3).sorted
    println(sorted.toList)

7.aggregate()

The elements in the partition are aggregated according to the logic and initial values in the partition , Then aggregate according to the logic between partitions and initial values

    val listRDD: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4),4)
    val i: Int = listRDD.aggregate(10)(_-_,_-_)
    println(i)

8.fold()

aggregate Simplified version of operation , Intra partition and inter partition logic are the same .

    // fold Computational logic and aggregate identical    Will use the initial value twice '
    //  The calculation logic within and between partitions is the same 
    val j: Int = listRDD.fold(10)(_ + _)
    val f: Int = listRDD.fold(10)(_ - _)
    println(j)
    println(f)

9.countByKey()

Count each key The number of

   val value: RDD[(String, Int)] = sc.makeRDD(
   List(("a", 10), ("b", 7), ("a", 11), ("b", 21)), 4)
   val stringToLong: collection.Map[String, Long] = value.countByKey()
   println(stringToLong)

10.save

1.saveAsTextFile(path) Save as Text file

Set the elements of the dataset as textfile In the form of HDFS File systems or other supported file systems , For each element ,Spark Will call toString Method , Replace it with the text in the file .

2.saveAsSequenceFile(path) Save as Sequencefile file

Set the elements in the dataset as Hadoop Sequencefile Save the format to the specified directory , You can make HDFS Or other Hadoop Supported file systems . notes ： Only KV Type has this operation .

3.saveAsObjectFile(path) Serialized objects are saved to a file

Is used to RDD Elements in are sequenced into objects , Store in file .

    //  Save based text file    You can read it directly 
    value.saveAsTextFile("output")
    // saveAsSequenceFile   Can only be used for binary rdd
    value.saveAsSequenceFile("output")
    value.saveAsObjectFile("output")

11.foreach()

Traverse RDD Each element in

    val intRDD: RDD[Int] = sc.makeRDD(1 to 20, 5)
    //  Use collect Add print 
    val ints: Array[Int] = intRDD.collect()
    ints.foreach(println)

    //  Use foreach Print directly 
    //  Directly in ex End for printing    Multithreaded printing    Overall disorder    The partition is orderly 
    intRDD.foreach(println)

原网站

版权声明
本文为[Diligent ls]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202160058332682.html