当前位置:网站首页>Overview of spark RDD
Overview of spark RDD
2022-07-06 02:04:00 【Diligent ls】
One 、 What is? RDD
RDD(Resilient Distributed Dataset) It's called elastic distributed data sets , yes Spark The most basic data abstraction in .
The code is an abstract class , It represents a flexible 、 immutable 、 Divisible 、 A set of elements that can be calculated in parallel .
1. elastic :
Storage flexibility : Automatic switching between memory and disk
The resilience of fault tolerance : Data loss can be recovered automatically
Elasticity of calculation : Calculation error retrial mechanism
The elasticity of slices : It can be re sliced as needed
2. Distributed
Data is stored on different nodes of the big data cluster
3. Datasets do not store data
RDD Encapsulates the computational logic , Do not save datasets
4. Data abstraction
RDD It's an abstract class , You need a subclass to implement that
5. immutable
RDD Encapsulates the computational logic , It's unchangeable , Want to change can only produce new RDD, In the new RDD Encapsulate computing logic
6. Divisible , Parallel operation
notes : all RDD Operator related operations are Executor End execution ,RDD Operations other than operators are Driver End execution .
stay Spark in , Only meet action Equal action operator , Will execute RDD Arithmetic , That is, delay calculation
Two 、RDD Five characteristics of
1)A list of partitions
RDD By many partition constitute , stay spark in , Calculation formula , How many? partition It corresponds to how many task To execute
2)A function for computing each split
Yes RDD Do calculations , It's equivalent to RDD Each split or partition Do calculations
3)A list of dependencies on other RDDs
RDD There's a dependency , It can be traced back to
4)Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
If RDD The data stored in it is key-value form , You can pass a custom Partitioner Re zoning , For example, you can press key Of hash Value partition
5)Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
The best position to calculate , That is, the locality of data
Calculate each split when , stay split Run locally on the machine task It's the best , Avoid data movement ;split There are multiple copies , therefore preferred location More than one
Where is the data , Priority should be given to scheduling jobs to the machine where the data resides , Reduce data IO And network transmission , In this way, we can better reduce the running time of jobs ( Barrel principle : The running time of the job depends on the slowest task Time required ), Improve performance
The feature introduction is reproduced from (https://www.jianshu.com/p/650d6e33914b)
边栏推荐
- 剑指 Offer 12. 矩阵中的路径
- [flask] obtain request information, redirect and error handling
- [understanding of opportunity-39]: Guiguzi - Chapter 5 flying clamp - warning 2: there are six types of praise. Be careful to enjoy praise as fish enjoy bait.
- Basic operations of database and table ----- set the fields of the table to be automatically added
- Jisuanke - t2063_ Missile interception
- 02.Go语言开发环境配置
- [Clickhouse] Clickhouse based massive data interactive OLAP analysis scenario practice
- Basic operations of databases and tables ----- default constraints
- Install redis
- [solved] how to generate a beautiful static document description page
猜你喜欢
Social networking website for college students based on computer graduation design PHP
Redis如何实现多可用区?
Kubernetes stateless application expansion and contraction capacity
Unity learning notes -- 2D one-way platform production method
How does the crystal oscillator vibrate?
Using SA token to solve websocket handshake authentication
Campus second-hand transaction based on wechat applet
Know MySQL database
Maya hollowed out modeling
02. Go language development environment configuration
随机推荐
Grabbing and sorting out external articles -- status bar [4]
[technology development -28]: overview of information and communication network, new technology forms, high-quality development of information and communication industry
SQL statement
Online reservation system of sports venues based on PHP
Reasonable and sensible
Basic operations of databases and tables ----- unique constraints
Shutter doctor: Xcode installation is incomplete
module ‘tensorflow. contrib. data‘ has no attribute ‘dataset
MySQL index
Force buckle 9 palindromes
【Flask】官方教程(Tutorial)-part1:项目布局、应用程序设置、定义和访问数据库
2 power view
2022年PMP项目管理考试敏捷知识点(8)
Internship: unfamiliar annotations involved in the project code and their functions
Leetcode skimming questions_ Invert vowels in a string
[Jiudu OJ 09] two points to find student information
Leetcode skimming questions_ Verify palindrome string II
[solution] add multiple directories in different parts of the same word document
Kubernetes stateless application expansion and contraction capacity
PHP campus financial management system for computer graduation design