当前位置:网站首页>Overview of spark RDD
Overview of spark RDD
2022-07-06 02:04:00 【Diligent ls】
One 、 What is? RDD
RDD(Resilient Distributed Dataset) It's called elastic distributed data sets , yes Spark The most basic data abstraction in .
The code is an abstract class , It represents a flexible 、 immutable 、 Divisible 、 A set of elements that can be calculated in parallel .
1. elastic :
Storage flexibility : Automatic switching between memory and disk
The resilience of fault tolerance : Data loss can be recovered automatically
Elasticity of calculation : Calculation error retrial mechanism
The elasticity of slices : It can be re sliced as needed
2. Distributed
Data is stored on different nodes of the big data cluster
3. Datasets do not store data
RDD Encapsulates the computational logic , Do not save datasets
4. Data abstraction
RDD It's an abstract class , You need a subclass to implement that
5. immutable
RDD Encapsulates the computational logic , It's unchangeable , Want to change can only produce new RDD, In the new RDD Encapsulate computing logic
6. Divisible , Parallel operation
notes : all RDD Operator related operations are Executor End execution ,RDD Operations other than operators are Driver End execution .
stay Spark in , Only meet action Equal action operator , Will execute RDD Arithmetic , That is, delay calculation
Two 、RDD Five characteristics of
1)A list of partitions
RDD By many partition constitute , stay spark in , Calculation formula , How many? partition It corresponds to how many task To execute
2)A function for computing each split
Yes RDD Do calculations , It's equivalent to RDD Each split or partition Do calculations
3)A list of dependencies on other RDDs
RDD There's a dependency , It can be traced back to
4)Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
If RDD The data stored in it is key-value form , You can pass a custom Partitioner Re zoning , For example, you can press key Of hash Value partition
5)Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
The best position to calculate , That is, the locality of data
Calculate each split when , stay split Run locally on the machine task It's the best , Avoid data movement ;split There are multiple copies , therefore preferred location More than one
Where is the data , Priority should be given to scheduling jobs to the machine where the data resides , Reduce data IO And network transmission , In this way, we can better reduce the running time of jobs ( Barrel principle : The running time of the job depends on the slowest task Time required ), Improve performance
The feature introduction is reproduced from (https://www.jianshu.com/p/650d6e33914b)
边栏推荐
- 500 lines of code to understand the principle of mecached cache client driver
- Folio. Ink is a free, fast and easy-to-use image sharing tool
- 安装Redis
- 2 power view
- Grabbing and sorting out external articles -- status bar [4]
- Basic operations of database and table ----- set the fields of the table to be automatically added
- PHP campus movie website system for computer graduation design
- leetcode3、实现 strStr()
- Initialize MySQL database when docker container starts
- ClickOnce does not support request execution level 'requireAdministrator'
猜你喜欢
2022年PMP项目管理考试敏捷知识点(8)
Basic operations of databases and tables ----- non empty constraints
leetcode3、实现 strStr()
Numpy array index slice
Concept of storage engine
Computer graduation design PHP enterprise staff training management system
【Flask】官方教程(Tutorial)-part2:蓝图-视图、模板、静态文件
Basic operations of databases and tables ----- unique constraints
Leetcode skimming questions_ Invert vowels in a string
A Cooperative Approach to Particle Swarm Optimization
随机推荐
Regular expressions: examples (1)
Bidding promotion process
Alibaba canal usage details (pit draining version)_ MySQL and ES data synchronization
selenium 元素定位(2)
2 power view
[width first search] Ji Suan Ke: Suan tou Jun goes home (BFS with conditions)
How does the crystal oscillator vibrate?
竞价推广流程
【Flask】官方教程(Tutorial)-part1:项目布局、应用程序设置、定义和访问数据库
[le plus complet du réseau] | interprétation complète de MySQL explicite
Luo Gu P1170 Bugs Bunny and Hunter
500 lines of code to understand the principle of mecached cache client driver
Basic operations of databases and tables ----- non empty constraints
Visualstudio2019 compilation configuration lastools-v2.0.0 under win10 system
Computer graduation design PHP part-time recruitment management system for College Students
dried food! Accelerating sparse neural network through hardware and software co design
Pangolin Library: subgraph
Redis如何实现多可用区?
Grabbing and sorting out external articles -- status bar [4]
How does redis implement multiple zones?