当前位置:网站首页>Overview of spark RDD
Overview of spark RDD
2022-07-06 02:04:00 【Diligent ls】
One 、 What is? RDD
RDD(Resilient Distributed Dataset) It's called elastic distributed data sets , yes Spark The most basic data abstraction in .
The code is an abstract class , It represents a flexible 、 immutable 、 Divisible 、 A set of elements that can be calculated in parallel .
1. elastic :
Storage flexibility : Automatic switching between memory and disk
The resilience of fault tolerance : Data loss can be recovered automatically
Elasticity of calculation : Calculation error retrial mechanism
The elasticity of slices : It can be re sliced as needed
2. Distributed
Data is stored on different nodes of the big data cluster
3. Datasets do not store data
RDD Encapsulates the computational logic , Do not save datasets
4. Data abstraction
RDD It's an abstract class , You need a subclass to implement that
5. immutable
RDD Encapsulates the computational logic , It's unchangeable , Want to change can only produce new RDD, In the new RDD Encapsulate computing logic
6. Divisible , Parallel operation
notes : all RDD Operator related operations are Executor End execution ,RDD Operations other than operators are Driver End execution .
stay Spark in , Only meet action Equal action operator , Will execute RDD Arithmetic , That is, delay calculation
Two 、RDD Five characteristics of
1)A list of partitions
RDD By many partition constitute , stay spark in , Calculation formula , How many? partition It corresponds to how many task To execute
2)A function for computing each split
Yes RDD Do calculations , It's equivalent to RDD Each split or partition Do calculations
3)A list of dependencies on other RDDs
RDD There's a dependency , It can be traced back to
4)Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
If RDD The data stored in it is key-value form , You can pass a custom Partitioner Re zoning , For example, you can press key Of hash Value partition
5)Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
The best position to calculate , That is, the locality of data
Calculate each split when , stay split Run locally on the machine task It's the best , Avoid data movement ;split There are multiple copies , therefore preferred location More than one
Where is the data , Priority should be given to scheduling jobs to the machine where the data resides , Reduce data IO And network transmission , In this way, we can better reduce the running time of jobs ( Barrel principle : The running time of the job depends on the slowest task Time required ), Improve performance
The feature introduction is reproduced from (https://www.jianshu.com/p/650d6e33914b)
边栏推荐
- You are using pip version 21.1.1; however, version 22.0.3 is available. You should consider upgradin
- It's wrong to install PHP zbarcode extension. I don't know if any God can help me solve it. 7.3 for PHP environment
- NLP fourth paradigm: overview of prompt [pre train, prompt, predict] [Liu Pengfei]
- Folio. Ink is a free, fast and easy-to-use image sharing tool
- [le plus complet du réseau] | interprétation complète de MySQL explicite
- Computer graduation design PHP college student human resources job recruitment network
- SQL statement
- 【网络攻防实训习题】
- Redis list
- D22:indeterminate equation (indefinite equation, translation + problem solution)
猜你喜欢
1. Introduction to basic functions of power query
02. Go language development environment configuration
NumPy 数组索引 切片
A Cooperative Approach to Particle Swarm Optimization
Folio. Ink is a free, fast and easy-to-use image sharing tool
It's wrong to install PHP zbarcode extension. I don't know if any God can help me solve it. 7.3 for PHP environment
National intangible cultural heritage inheritor HD Wang's shadow digital collection of "Four Beauties" made an amazing debut!
Basic operations of databases and tables ----- default constraints
Kubernetes stateless application expansion and contraction capacity
[technology development -28]: overview of information and communication network, new technology forms, high-quality development of information and communication industry
随机推荐
UE4 unreal engine, editor basic application, usage skills (IV)
【Flask】官方教程(Tutorial)-part1:项目布局、应用程序设置、定义和访问数据库
Accelerating spark data access with alluxio in kubernetes
Executing two identical SQL statements in the same sqlsession will result in different total numbers
阿里测开面试题
Leetcode skimming questions_ Invert vowels in a string
【Flask】官方教程(Tutorial)-part2:蓝图-视图、模板、静态文件
It's wrong to install PHP zbarcode extension. I don't know if any God can help me solve it. 7.3 for PHP environment
Redis-字符串类型
Using SA token to solve websocket handshake authentication
leetcode3、實現 strStr()
Kubernetes stateless application expansion and contraction capacity
2022 PMP project management examination agile knowledge points (8)
[flask] official tutorial -part1: project layout, application settings, definition and database access
Know MySQL database
【clickhouse】ClickHouse Practice in EOI
Computer graduation design PHP enterprise staff training management system
SQL statement
Get the relevant information of ID card through PHP, get the zodiac, get the constellation, get the age, and get the gender
Genius storage uses documents, a browser caching tool