当前位置：网站首页>Overview of spark RDD

Overview of spark RDD

2022-07-06 02:04:00 【Diligent ls】

One 、 What is? RDD

RDD（Resilient Distributed Dataset） It's called elastic distributed data sets , yes Spark The most basic data abstraction in .

The code is an abstract class , It represents a flexible 、 immutable 、 Divisible 、 A set of elements that can be calculated in parallel .

1. elastic ：

Storage flexibility ： Automatic switching between memory and disk

The resilience of fault tolerance ： Data loss can be recovered automatically

Elasticity of calculation ： Calculation error retrial mechanism

The elasticity of slices ： It can be re sliced as needed

2. Distributed

Data is stored on different nodes of the big data cluster

3. Datasets do not store data

RDD Encapsulates the computational logic , Do not save datasets

4. Data abstraction

RDD It's an abstract class , You need a subclass to implement that

5. immutable

RDD Encapsulates the computational logic , It's unchangeable , Want to change can only produce new RDD, In the new RDD Encapsulate computing logic

6. Divisible , Parallel operation

notes ： all RDD Operator related operations are Executor End execution ,RDD Operations other than operators are Driver End execution .

stay Spark in , Only meet action Equal action operator , Will execute RDD Arithmetic , That is, delay calculation

Two 、RDD Five characteristics of

1）A list of partitions
        RDD By many partition constitute , stay spark in , Calculation formula , How many? partition It corresponds to how many task To execute
2）A function for computing each split
         Yes RDD Do calculations , It's equivalent to RDD Each split or partition Do calculations
3）A list of dependencies on other RDDs
        RDD There's a dependency , It can be traced back to
4）Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
         If RDD The data stored in it is key-value form , You can pass a custom Partitioner Re zoning , For example, you can press key Of hash Value partition
5）Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
         The best position to calculate , That is, the locality of data
         Calculate each split when , stay split Run locally on the machine task It's the best , Avoid data movement ;split There are multiple copies , therefore preferred location More than one
         Where is the data , Priority should be given to scheduling jobs to the machine where the data resides , Reduce data IO And network transmission , In this way, we can better reduce the running time of jobs （ Barrel principle ： The running time of the job depends on the slowest task Time required ）, Improve performance

The feature introduction is reproduced from （https://www.jianshu.com/p/650d6e33914b）

原网站

版权声明
本文为[Diligent ls]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202140042491058.html