当前位置:网站首页>Overview of spark RDD
Overview of spark RDD
2022-07-06 02:04:00 【Diligent ls】
One 、 What is? RDD
RDD(Resilient Distributed Dataset) It's called elastic distributed data sets , yes Spark The most basic data abstraction in .
The code is an abstract class , It represents a flexible 、 immutable 、 Divisible 、 A set of elements that can be calculated in parallel .
1. elastic :
Storage flexibility : Automatic switching between memory and disk
The resilience of fault tolerance : Data loss can be recovered automatically
Elasticity of calculation : Calculation error retrial mechanism
The elasticity of slices : It can be re sliced as needed
2. Distributed
Data is stored on different nodes of the big data cluster
3. Datasets do not store data
RDD Encapsulates the computational logic , Do not save datasets
4. Data abstraction
RDD It's an abstract class , You need a subclass to implement that
5. immutable
RDD Encapsulates the computational logic , It's unchangeable , Want to change can only produce new RDD, In the new RDD Encapsulate computing logic
6. Divisible , Parallel operation
notes : all RDD Operator related operations are Executor End execution ,RDD Operations other than operators are Driver End execution .
stay Spark in , Only meet action Equal action operator , Will execute RDD Arithmetic , That is, delay calculation
Two 、RDD Five characteristics of

1)A list of partitions
RDD By many partition constitute , stay spark in , Calculation formula , How many? partition It corresponds to how many task To execute
2)A function for computing each split
Yes RDD Do calculations , It's equivalent to RDD Each split or partition Do calculations
3)A list of dependencies on other RDDs
RDD There's a dependency , It can be traced back to
4)Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
If RDD The data stored in it is key-value form , You can pass a custom Partitioner Re zoning , For example, you can press key Of hash Value partition
5)Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
The best position to calculate , That is, the locality of data
Calculate each split when , stay split Run locally on the machine task It's the best , Avoid data movement ;split There are multiple copies , therefore preferred location More than one
Where is the data , Priority should be given to scheduling jobs to the machine where the data resides , Reduce data IO And network transmission , In this way, we can better reduce the running time of jobs ( Barrel principle : The running time of the job depends on the slowest task Time required ), Improve performance
The feature introduction is reproduced from (https://www.jianshu.com/p/650d6e33914b)
边栏推荐
- Tensorflow customize the whole training process
- [Jiudu OJ 09] two points to find student information
- Internship: unfamiliar annotations involved in the project code and their functions
- 02.Go语言开发环境配置
- Leetcode skimming questions_ Invert vowels in a string
- SPI communication protocol
- Grabbing and sorting out external articles -- status bar [4]
- Computer graduation design PHP college student human resources job recruitment network
- leetcode-2. Palindrome judgment
- NiO related knowledge (II)
猜你喜欢

UE4 unreal engine, editor basic application, usage skills (IV)

Numpy array index slice

2022 PMP project management examination agile knowledge points (8)

It's wrong to install PHP zbarcode extension. I don't know if any God can help me solve it. 7.3 for PHP environment

How to improve the level of pinduoduo store? Dianyingtong came to tell you

Leetcode skimming questions_ Invert vowels in a string

SQL statement

Kubernetes stateless application expansion and contraction capacity

安装php-zbarcode扩展时报错,不知道有没有哪位大神帮我解决一下呀 php 环境用的7.3

Card 4G industrial router charging pile intelligent cabinet private network video monitoring 4G to Ethernet to WiFi wired network speed test software and hardware customization
随机推荐
Paddle框架:PaddleNLP概述【飛槳自然語言處理開發庫】
Using SA token to solve websocket handshake authentication
Folio. Ink is a free, fast and easy-to-use image sharing tool
Redis-字符串类型
Social networking website for college students based on computer graduation design PHP
[Jiudu OJ 09] two points to find student information
Ali test open-ended questions
LeetCode 322. Change exchange (dynamic planning)
【Flask】官方教程(Tutorial)-part3:blog蓝图、项目可安装化
Gbase 8C database upgrade error
[depth first search] Ji Suan Ke: Betsy's trip
Extracting key information from TrueType font files
Leetcode skimming questions_ Sum of squares
剑指 Offer 12. 矩阵中的路径
How to upgrade kubernetes in place
Basic operations of databases and tables ----- unique constraints
How to set an alias inside a bash shell script so that is it visible from the outside?
leetcode-两数之和
How does the crystal oscillator vibrate?
国家级非遗传承人高清旺《四大美人》皮影数字藏品惊艳亮相!