当前位置:网站首页>Introduction to spark core components
Introduction to spark core components
2022-07-04 06:51:00 【A program ape with wet writing】
Spark core Component is introduced
1、RDD Introduce
Spark The core of is to build on a unified elastic distributed data set (Resilient Distributed Datasets,RDD) Above , This makes Spark The components of can be seamlessly integrated , Be able to complete big data processing in the same application .
RDD It is actually an abstraction of a distributed data set , In terms of physical storage , A dataset may be divided into multiple partitions , Each partition may be stored in different storage / On calculation node , and RDD Is an abstraction on the dataset , Represents the entire data set , But this RDD It doesn't physically put data together .
With RDD This abstraction , Users can easily operate a distributed data set from a portal .
RDD With fault tolerance mechanism , And it is read-only and cannot be modified , You can perform certain conversion operations to create new RDD.
2、RDD Basic operation
RDD Provides a large number of API For users , There are two main categories :transformation and action.
(1)Transformation be used for RDD Deformation of (RDD Have no variability ), Every time transformation Will produce new RDD;

(2)Action It is usually used for the summary or output of results ;
3、dependency Introduce
stay DAG In the figure ,RDD And RDD The connection between them is called dependency (dependency)
And dependencies are divided into narrow dependencies (Narrow Dependency) And wide dependence (Shuffle Dependency)
(1) In narrow dependence , Father RDD Every one of partition, Only quilt RDD One of the partition rely on 
(2) In wide dependence , Father RDD One of the partition, The quilt RDD The multiple partition rely on 
4、Job Introduce
stay Spark Every time action Will trigger the real computation , and transformation Just record the calculation steps , Not actually calculated . Will each time action The calculation triggered is called :Job

Every time transformation Will generate a new RDD, and action You can view it as Job Output

Spark Each of them Job, Can be seen as RDD After several deformations and then output
RDD In the process of deformation, many new RDD, To form a system of RDD Constitute the execution flow chart (DAG)

5、Stage Introduce
Spark On schedule DAG when , Instead of directly executing the operations of each node in turn , It's about putting the whole DAG Divide into different Stage

When scheduling , From the last Stage Start submission , If it's time to Stage The father of Stage Not completed , Then recursively submit its father Stage, Until there is no father Stage Or all fathers Stage Completed
6、Shuffle Introduce
Why? spark To divide Stage Instead of directly following the steps ?
Spark In dividing stage when , Every time we encounter a wide dependency , Then draw a new stage, Because whenever we encounter wide dependence , It has to be done once “ Data shuffle ” operation (shuffle)

Shuffle occurs ( Wide dependence )
from RDD From the perspective of , Is the parent partition different key The data of is distributed to sub partitions , The parent partition has different key So there are different sub partitions that depend on it
From a physical point of view , Different partitions are distributed on different nodes , In this case, you need to move data from one node to another , Data movement has occurred ;

because shuffle The process involves data movement , therefore ,Spark Press to trigger shuffle Wide dependency partition of Stage, Divided by Stage,Stage We need to exchange data between them , It involves network transmission and disk IO, and Stage Internal operations , Then there is no need to drop the plate , Continuous calculation can be carried out directly in memory , It greatly accelerates the calculation speed , This is also spark The claim is based on “ Memory computing ” Important reasons .
Shuffle The process will DAG Divided into a number of Stage, bring Spark When scheduling , We can use Stage Scheduling at the same granularity ,
Such a scheduling mode , Can make the same Stage Continuously on the same computing node ,Stage Internal can exchange data in memory for efficient calculation , and Stage Between them, through Shuffle Exchange data .
On the other hand ,Spark By abstracting computation into task, To unify api, Realized with RDD Different operations of data and shuffle The process .

7、Task Introduce
The real calculation takes place in Executor Implemented in Task, Every Task Only responsible for one partition The calculation of , Every RDD All must be realized compute Method , This method defines how to calculate a single partition.

(1)Task Implementation
By multiple RDD Of compute The method forms the whole calculation chain , stay shuffle when , Father RDD Of compute Method will complete shuffle write operation , Son RDD Of compute Method will complete shuffle read operation , The cooperation between the two has been completed shuffle operation
such Spark Just through RDD Unified compute Method , Completed the whole calculation chain .


Horizontal perspective -partition The calculation of


Summary
Look vertically , Every calculation process , Will form a RDD, all RDD And dependencies make up DAG,Shuffle The process will DAG Divided into... One by one Stage
Look horizontally , every last partition Calculation of data , It depends on the last RDD( Or document ) Corresponding partition Calculated as input data , The whole data calculation process , Every partition Its dependence constitutes the whole computing chain
Horizontal perspective provides RDD Unified high-level abstract features , Users don't have to worry about whether there is shuffle This kind of operation , Just care RDD Data transformation of , Greatly simplifies the programming model
The vertical perspective provides stage Level scheduling , Make unified stage There is no need to drop all calculations in , Greatly improve the calculation speed

边栏推荐
- 测试用例的设计
- C语言中的排序,实现从小到大的数字排序法
- The important role of host reinforcement concept in medical industry
- Selenium ide plug-in download, installation and use tutorial
- 内卷怎么破?
- Redis interview question set
- the input device is not a TTY. If you are using mintty, try prefixing the command with ‘winpty‘
- ORICO ORICO outdoor power experience, lightweight and portable, the most convenient office charging station
- 2022年6月小结
- Since DMS is upgraded to a new version, my previous SQL is in the old version of DMS. In this case, how can I retrieve my previous SQL?
猜你喜欢

A new understanding of how to encrypt industrial computers: host reinforcement application

How notepad++ counts words

Variables d'environnement personnalisées uniapp

uniapp小程序分包

MySQL 45 lecture learning notes (VII) line lock
![[GF (q) + LDPC] regular LDPC coding and decoding design and MATLAB simulation based on the GF (q) field of binary graph](/img/5e/7ce21dd544aacf23b4ceef1ec547fd.png)
[GF (q) + LDPC] regular LDPC coding and decoding design and MATLAB simulation based on the GF (q) field of binary graph

Another company raised the price of SAIC Roewe new energy products from March 1

Which water in the environment needs water quality monitoring

Mobile adaptation: vw/vh

【网络数据传输】基于FPGA的百兆网/兆网千UDP数据包收发系统开发,PC到FPGA
随机推荐
Tar source code analysis 4
Uniapp applet subcontracting
Su Weijie, a member of Qingyuan Association and an assistant professor at the University of Pennsylvania, won the first Siam Youth Award for data science, focusing on privacy data protection, etc
Highly paid programmers & interview questions: how does redis of series 119 realize distributed locks?
How does the recv of TCP socket receive messages of specified length?
[GF (q) + LDPC] regular LDPC coding and decoding design and MATLAB simulation based on the GF (q) field of binary graph
Shopping malls, storerooms, flat display, user-defined maps can also be played like this!
[Android reverse] function interception (CPU cache mechanism | CPU cache mechanism causes function interception failure)
tars源码分析之3
Design of test cases
[problem record] 03 connect to MySQL database prompt: 1040 too many connections
What is Gibson's law?
What is a spotlight effect?
Responsive mobile web test questions
selenium驱动IE常见问题解决Message: Currently focused window has been closed.
2022 is probably the best year for the economy in the next 10 years. Did you graduate in 2022? What is the plan after graduation?
C语言中的排序,实现从小到大的数字排序法
1、 Relevant theories and tools of network security penetration testing
Google Chrome Portable Google Chrome browser portable version official website download method
Displaying currency in Indian numbering format