当前位置:网站首页>Overview of spark RDD
Overview of spark RDD
2022-07-06 02:04:00 【Diligent ls】
One 、 What is? RDD
RDD(Resilient Distributed Dataset) It's called elastic distributed data sets , yes Spark The most basic data abstraction in .
The code is an abstract class , It represents a flexible 、 immutable 、 Divisible 、 A set of elements that can be calculated in parallel .
1. elastic :
Storage flexibility : Automatic switching between memory and disk
The resilience of fault tolerance : Data loss can be recovered automatically
Elasticity of calculation : Calculation error retrial mechanism
The elasticity of slices : It can be re sliced as needed
2. Distributed
Data is stored on different nodes of the big data cluster
3. Datasets do not store data
RDD Encapsulates the computational logic , Do not save datasets
4. Data abstraction
RDD It's an abstract class , You need a subclass to implement that
5. immutable
RDD Encapsulates the computational logic , It's unchangeable , Want to change can only produce new RDD, In the new RDD Encapsulate computing logic
6. Divisible , Parallel operation
notes : all RDD Operator related operations are Executor End execution ,RDD Operations other than operators are Driver End execution .
stay Spark in , Only meet action Equal action operator , Will execute RDD Arithmetic , That is, delay calculation
Two 、RDD Five characteristics of
1)A list of partitions
RDD By many partition constitute , stay spark in , Calculation formula , How many? partition It corresponds to how many task To execute
2)A function for computing each split
Yes RDD Do calculations , It's equivalent to RDD Each split or partition Do calculations
3)A list of dependencies on other RDDs
RDD There's a dependency , It can be traced back to
4)Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
If RDD The data stored in it is key-value form , You can pass a custom Partitioner Re zoning , For example, you can press key Of hash Value partition
5)Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
The best position to calculate , That is, the locality of data
Calculate each split when , stay split Run locally on the machine task It's the best , Avoid data movement ;split There are multiple copies , therefore preferred location More than one
Where is the data , Priority should be given to scheduling jobs to the machine where the data resides , Reduce data IO And network transmission , In this way, we can better reduce the running time of jobs ( Barrel principle : The running time of the job depends on the slowest task Time required ), Improve performance
The feature introduction is reproduced from (https://www.jianshu.com/p/650d6e33914b)
边栏推荐
- Know MySQL database
- Blue Bridge Cup embedded_ STM32_ New project file_ Explain in detail
- LeetCode 322. Change exchange (dynamic planning)
- ClickOnce does not support request execution level 'requireAdministrator'
- NiO related knowledge (II)
- [solution] add multiple directories in different parts of the same word document
- Basic operations of database and table ----- set the fields of the table to be automatically added
- [detailed] several ways to quickly realize object mapping
- Flutter Doctor:Xcode 安装不完整
- [flask] official tutorial -part1: project layout, application settings, definition and database access
猜你喜欢
02.Go语言开发环境配置
Card 4G industrial router charging pile intelligent cabinet private network video monitoring 4G to Ethernet to WiFi wired network speed test software and hardware customization
Online reservation system of sports venues based on PHP
Open source | Ctrip ticket BDD UI testing framework flybirds
[flask] official tutorial -part1: project layout, application settings, definition and database access
Derivation of Biot Savart law in College Physics
Tensorflow customize the whole training process
TrueType字体文件提取关键信息
【Flask】官方教程(Tutorial)-part1:项目布局、应用程序设置、定义和访问数据库
Virtual machine network, networking settings, interconnection with host computer, network configuration
随机推荐
Basic operations of databases and tables ----- primary key constraints
Leetcode sum of two numbers
Paddle framework: paddlenlp overview [propeller natural language processing development library]
Online reservation system of sports venues based on PHP
[flask] static file and template rendering
MCU lightweight system core
Bidding promotion process
Grabbing and sorting out external articles -- status bar [4]
Unreal browser plug-in
[width first search] Ji Suan Ke: Suan tou Jun goes home (BFS with conditions)
Competition question 2022-6-26
Redis key operation
1. Introduction to basic functions of power query
阿里测开面试题
Know MySQL database
I like Takeshi Kitano's words very much: although it's hard, I will still choose that kind of hot life
[flask] response, session and message flashing
【网络攻防实训习题】
3D vision - 4 Getting started with gesture recognition - using mediapipe includes single frame and real time video
[the most complete in the whole network] |mysql explain full interpretation