当前位置:网站首页>Introduction to spark core components
Introduction to spark core components
2022-07-04 06:51:00 【A program ape with wet writing】
Spark core Component is introduced
1、RDD Introduce
Spark The core of is to build on a unified elastic distributed data set (Resilient Distributed Datasets
,RDD) Above , This makes Spark The components of can be seamlessly integrated , Be able to complete big data processing in the same application .
RDD It is actually an abstraction of a distributed data set , In terms of physical storage , A dataset may be divided into multiple partitions , Each partition may be stored in different storage / On calculation node , and RDD Is an abstraction on the dataset , Represents the entire data set , But this RDD It doesn't physically put data together .
With RDD This abstraction , Users can easily operate a distributed data set from a portal .
RDD With fault tolerance mechanism , And it is read-only and cannot be modified , You can perform certain conversion operations to create new RDD.
2、RDD Basic operation
RDD Provides a large number of API For users , There are two main categories :transformation and action.
(1)Transformation be used for RDD Deformation of (RDD Have no variability ), Every time transformation Will produce new RDD;
(2)Action It is usually used for the summary or output of results ;
3、dependency Introduce
stay DAG In the figure ,RDD And RDD The connection between them is called dependency (dependency)
And dependencies are divided into narrow dependencies (Narrow Dependency) And wide dependence (Shuffle Dependency)
(1) In narrow dependence , Father RDD Every one of partition, Only quilt RDD One of the partition rely on
(2) In wide dependence , Father RDD One of the partition, The quilt RDD The multiple partition rely on
4、Job Introduce
stay Spark Every time action Will trigger the real computation , and transformation Just record the calculation steps , Not actually calculated . Will each time action The calculation triggered is called :Job
Every time transformation Will generate a new RDD, and action You can view it as Job Output
Spark Each of them Job, Can be seen as RDD After several deformations and then output
RDD In the process of deformation, many new RDD, To form a system of RDD Constitute the execution flow chart (DAG)
5、Stage Introduce
Spark On schedule DAG when , Instead of directly executing the operations of each node in turn , It's about putting the whole DAG Divide into different Stage
When scheduling , From the last Stage Start submission , If it's time to Stage The father of Stage Not completed , Then recursively submit its father Stage, Until there is no father Stage Or all fathers Stage Completed
6、Shuffle Introduce
Why? spark To divide Stage Instead of directly following the steps ?
Spark In dividing stage when , Every time we encounter a wide dependency , Then draw a new stage, Because whenever we encounter wide dependence , It has to be done once “ Data shuffle ” operation (shuffle)
Shuffle occurs ( Wide dependence )
from RDD From the perspective of , Is the parent partition different key The data of is distributed to sub partitions , The parent partition has different key So there are different sub partitions that depend on it
From a physical point of view , Different partitions are distributed on different nodes , In this case, you need to move data from one node to another , Data movement has occurred ;
because shuffle The process involves data movement , therefore ,Spark Press to trigger shuffle Wide dependency partition of Stage, Divided by Stage,Stage We need to exchange data between them , It involves network transmission and disk IO, and Stage Internal operations , Then there is no need to drop the plate , Continuous calculation can be carried out directly in memory , It greatly accelerates the calculation speed , This is also spark The claim is based on “ Memory computing ” Important reasons .
Shuffle The process will DAG Divided into a number of Stage, bring Spark When scheduling , We can use Stage Scheduling at the same granularity ,
Such a scheduling mode , Can make the same Stage Continuously on the same computing node ,Stage Internal can exchange data in memory for efficient calculation , and Stage Between them, through Shuffle Exchange data .
On the other hand ,Spark By abstracting computation into task, To unify api, Realized with RDD Different operations of data and shuffle The process .
7、Task Introduce
The real calculation takes place in Executor Implemented in Task, Every Task Only responsible for one partition The calculation of , Every RDD All must be realized compute Method , This method defines how to calculate a single partition.
(1)Task Implementation
By multiple RDD Of compute The method forms the whole calculation chain , stay shuffle when , Father RDD Of compute Method will complete shuffle write operation , Son RDD Of compute Method will complete shuffle read operation , The cooperation between the two has been completed shuffle operation
such Spark Just through RDD Unified compute Method , Completed the whole calculation chain .
Horizontal perspective -partition The calculation of
Summary
Look vertically , Every calculation process , Will form a RDD, all RDD And dependencies make up DAG,Shuffle The process will DAG Divided into... One by one Stage
Look horizontally , every last partition Calculation of data , It depends on the last RDD( Or document ) Corresponding partition Calculated as input data , The whole data calculation process , Every partition Its dependence constitutes the whole computing chain
Horizontal perspective provides RDD Unified high-level abstract features , Users don't have to worry about whether there is shuffle This kind of operation , Just care RDD Data transformation of , Greatly simplifies the programming model
The vertical perspective provides stage Level scheduling , Make unified stage There is no need to drop all calculations in , Greatly improve the calculation speed
边栏推荐
- regular expression
- tars源码分析之2
- Redis interview question set
- 1、 Relevant theories and tools of network security penetration testing
- 8. Factory method
- leetcode825. Age appropriate friends
- 2022 wechat enterprise mailbox login entry introduction, how to open and register enterprise wechat enterprise mailbox?
- If there are two sources in the same job, it will be reported that one of the databases cannot be found. Is there a boss to answer
- 响应式移动Web测试题
- Mysql 45讲学习笔记(十四)count(*)
猜你喜欢
ORICO ORICO outdoor power experience, lightweight and portable, the most convenient office charging station
About how idea sets up shortcut key sets
Which water in the environment needs water quality monitoring
Cervical vertebra, beriberi
What is the use of cloud redis? How to use cloud redis?
Appium foundation - appium installation (II)
Overview of convolutional neural network structure optimization
selenium驱动IE常见问题解决Message: Currently focused window has been closed.
Wechat applet scroll view component scrollable view area
2022 Xinjiang's latest eight members (Safety Officer) simulated examination questions and answers
随机推荐
Redis面试题集
JS common time processing functions
tars源码分析之3
tars源码分析之1
the input device is not a TTY. If you are using mintty, try prefixing the command with ‘winpty‘
Tar source code analysis 4
What is Gibson's law?
Since DMS is upgraded to a new version, my previous SQL is in the old version of DMS. In this case, how can I retrieve my previous SQL?
Data analysis notes 09
tcp socket 的 recv 如何接收指定长度消息?
1、 Relevant theories and tools of network security penetration testing
2022, peut - être la meilleure année économique de la prochaine décennie, avez - vous obtenu votre diplôme en 2022? Comment est - ce prévu après la remise des diplômes?
Selenium driver ie common problem solving message: currently focused window has been closed
R statistical mapping - random forest classification analysis and species abundance difference test combination diagram
Bottom problem of figure
MySQL 45 lecture learning notes (x) force index
regular expression
金盾视频播放器拦截的软件关键词和进程信息
If there are two sources in the same job, it will be reported that one of the databases cannot be found. Is there a boss to answer
Mobile adaptation: vw/vh