当前位置:网站首页>Overview of spark RDD

Overview of spark RDD

2022-07-06 02:04:00 Diligent ls

One 、 What is? RDD

        RDD(Resilient Distributed Dataset) It's called elastic distributed data sets , yes Spark The most basic data abstraction in .

         The code is an abstract class , It represents a flexible 、 immutable 、 Divisible 、 A set of elements that can be calculated in parallel .

1. elastic :

         Storage flexibility : Automatic switching between memory and disk

         The resilience of fault tolerance : Data loss can be recovered automatically

         Elasticity of calculation : Calculation error retrial mechanism

        The elasticity of slices : It can be re sliced as needed

2. Distributed

        Data is stored on different nodes of the big data cluster

3. Datasets do not store data

        RDD Encapsulates the computational logic , Do not save datasets

4. Data abstraction

        RDD It's an abstract class , You need a subclass to implement that

5. immutable

        RDD Encapsulates the computational logic , It's unchangeable , Want to change can only produce new RDD, In the new RDD Encapsulate computing logic

6. Divisible , Parallel operation

notes : all RDD Operator related operations are Executor End execution ,RDD Operations other than operators are Driver End execution .

        stay Spark in , Only meet action Equal action operator , Will execute RDD Arithmetic , That is, delay calculation

Two 、RDD Five characteristics of

1)A list of partitions
        RDD By many partition constitute , stay spark in , Calculation formula , How many? partition It corresponds to how many task To execute
2)A function for computing each split
         Yes RDD Do calculations , It's equivalent to RDD Each split or partition Do calculations
3)A list of dependencies on other RDDs
        RDD There's a dependency , It can be traced back to
4)Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
         If RDD The data stored in it is key-value form , You can pass a custom Partitioner Re zoning , For example, you can press key Of hash Value partition
5)Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
         The best position to calculate , That is, the locality of data
         Calculate each split when , stay split Run locally on the machine task It's the best , Avoid data movement ;split There are multiple copies , therefore preferred location More than one
         Where is the data , Priority should be given to scheduling jobs to the machine where the data resides , Reduce data IO And network transmission , In this way, we can better reduce the running time of jobs ( Barrel principle : The running time of the job depends on the slowest task Time required ), Improve performance

The feature introduction is reproduced from (https://www.jianshu.com/p/650d6e33914b)

 

原网站

版权声明
本文为[Diligent ls]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202140042491058.html