当前位置:网站首页>Introduction to spark core components
Introduction to spark core components
2022-07-04 06:51:00 【A program ape with wet writing】
Spark core Component is introduced
1、RDD Introduce
Spark The core of is to build on a unified elastic distributed data set (Resilient Distributed Datasets,RDD) Above , This makes Spark The components of can be seamlessly integrated , Be able to complete big data processing in the same application .
RDD It is actually an abstraction of a distributed data set , In terms of physical storage , A dataset may be divided into multiple partitions , Each partition may be stored in different storage / On calculation node , and RDD Is an abstraction on the dataset , Represents the entire data set , But this RDD It doesn't physically put data together .
With RDD This abstraction , Users can easily operate a distributed data set from a portal .
RDD With fault tolerance mechanism , And it is read-only and cannot be modified , You can perform certain conversion operations to create new RDD.
2、RDD Basic operation
RDD Provides a large number of API For users , There are two main categories :transformation and action.
(1)Transformation be used for RDD Deformation of (RDD Have no variability ), Every time transformation Will produce new RDD;

(2)Action It is usually used for the summary or output of results ;
3、dependency Introduce
stay DAG In the figure ,RDD And RDD The connection between them is called dependency (dependency)
And dependencies are divided into narrow dependencies (Narrow Dependency) And wide dependence (Shuffle Dependency)
(1) In narrow dependence , Father RDD Every one of partition, Only quilt RDD One of the partition rely on 
(2) In wide dependence , Father RDD One of the partition, The quilt RDD The multiple partition rely on 
4、Job Introduce
stay Spark Every time action Will trigger the real computation , and transformation Just record the calculation steps , Not actually calculated . Will each time action The calculation triggered is called :Job

Every time transformation Will generate a new RDD, and action You can view it as Job Output

Spark Each of them Job, Can be seen as RDD After several deformations and then output
RDD In the process of deformation, many new RDD, To form a system of RDD Constitute the execution flow chart (DAG)

5、Stage Introduce
Spark On schedule DAG when , Instead of directly executing the operations of each node in turn , It's about putting the whole DAG Divide into different Stage

When scheduling , From the last Stage Start submission , If it's time to Stage The father of Stage Not completed , Then recursively submit its father Stage, Until there is no father Stage Or all fathers Stage Completed
6、Shuffle Introduce
Why? spark To divide Stage Instead of directly following the steps ?
Spark In dividing stage when , Every time we encounter a wide dependency , Then draw a new stage, Because whenever we encounter wide dependence , It has to be done once “ Data shuffle ” operation (shuffle)

Shuffle occurs ( Wide dependence )
from RDD From the perspective of , Is the parent partition different key The data of is distributed to sub partitions , The parent partition has different key So there are different sub partitions that depend on it
From a physical point of view , Different partitions are distributed on different nodes , In this case, you need to move data from one node to another , Data movement has occurred ;

because shuffle The process involves data movement , therefore ,Spark Press to trigger shuffle Wide dependency partition of Stage, Divided by Stage,Stage We need to exchange data between them , It involves network transmission and disk IO, and Stage Internal operations , Then there is no need to drop the plate , Continuous calculation can be carried out directly in memory , It greatly accelerates the calculation speed , This is also spark The claim is based on “ Memory computing ” Important reasons .
Shuffle The process will DAG Divided into a number of Stage, bring Spark When scheduling , We can use Stage Scheduling at the same granularity ,
Such a scheduling mode , Can make the same Stage Continuously on the same computing node ,Stage Internal can exchange data in memory for efficient calculation , and Stage Between them, through Shuffle Exchange data .
On the other hand ,Spark By abstracting computation into task, To unify api, Realized with RDD Different operations of data and shuffle The process .

7、Task Introduce
The real calculation takes place in Executor Implemented in Task, Every Task Only responsible for one partition The calculation of , Every RDD All must be realized compute Method , This method defines how to calculate a single partition.

(1)Task Implementation
By multiple RDD Of compute The method forms the whole calculation chain , stay shuffle when , Father RDD Of compute Method will complete shuffle write operation , Son RDD Of compute Method will complete shuffle read operation , The cooperation between the two has been completed shuffle operation
such Spark Just through RDD Unified compute Method , Completed the whole calculation chain .


Horizontal perspective -partition The calculation of


Summary
Look vertically , Every calculation process , Will form a RDD, all RDD And dependencies make up DAG,Shuffle The process will DAG Divided into... One by one Stage
Look horizontally , every last partition Calculation of data , It depends on the last RDD( Or document ) Corresponding partition Calculated as input data , The whole data calculation process , Every partition Its dependence constitutes the whole computing chain
Horizontal perspective provides RDD Unified high-level abstract features , Users don't have to worry about whether there is shuffle This kind of operation , Just care RDD Data transformation of , Greatly simplifies the programming model
The vertical perspective provides stage Level scheduling , Make unified stage There is no need to drop all calculations in , Greatly improve the calculation speed

边栏推荐
- Redis面试题集
- What is tweeman's law?
- tars源码分析之1
- Selection (022) - what is the output of the following code?
- STM32 单片机ADC 电压计算
- 抽奖系统测试报告
- MySQL relearn 2- Alibaba cloud server CentOS installation mysql8.0
- 2022 where to find enterprise e-mail and which is the security of enterprise e-mail system?
- Cervical vertebra, beriberi
- 24 magicaccessorimpl can access the debugging of all methods
猜你喜欢

Common usage of time library

Fundamentals of SQL database operation

Cervical vertebra, beriberi

2022 Xinjiang's latest eight members (Safety Officer) simulated examination questions and answers

MySQL 45 lecture learning notes (VII) line lock

MySQL relearn 2- Alibaba cloud server CentOS installation mysql8.0

Bottom problem of figure
![[backpack DP] backpack problem](/img/7e/1ead6fd0ab61806ce971e1612b4ed6.jpg)
[backpack DP] backpack problem

R statistical mapping - random forest classification analysis and species abundance difference test combination diagram

The solution of win11 taskbar right click without Task Manager - add win11 taskbar right click function
随机推荐
What is the use of cloud redis? How to use cloud redis?
The solution of win11 taskbar right click without Task Manager - add win11 taskbar right click function
同一个job有两个source就报其中一个数据库找不到,有大佬回答下吗
MySQL 45 lecture learning notes (XIV) count (*)
Centos8 install mysql 7 unable to start up
GoogleChromePortable 谷歌chrome浏览器便携版官网下载方式
Selection (021) - what is the output of the following code?
Redis interview question set
ABCD four sequential execution methods, extended application
期末周,我裂开
[FPGA tutorial case 7] design and implementation of counter based on Verilog
Mysql 45讲学习笔记(十三)表数据删掉一半,表文件大小不变
2022年6月小结
MySQL 45 learning notes (XI) how to index string fields
notepad++如何统计单词数量
【FPGA教程案例8】基于verilog的分频器设计与实现
[problem record] 03 connect to MySQL database prompt: 1040 too many connections
金盾视频播放器拦截的软件关键词和进程信息
Uniapp applet subcontracting
Which water in the environment needs water quality monitoring