当前位置：网站首页>Spark Foundation

Spark Foundation

2022-06-13 03:13:00 【Yiliubei】

1、 What is? Spark

spark It is a computing engine specially designed for big data processing .

2、Spark and MR The difference between

They're all distributed computing frameworks ,Spark It's based on memory ,MR be based on HDFS.Spark The speed of processing data is MR More than ten times ,Spark Processing is based on inner initial calculation and outer initial calculation , also DAC Directed acyclic graph to segment the execution sequence of tasks .

3、Spark The mode of operation of

Local： For local testing
Standalone： The self-contained resource scheduling framework .
yarn：
Mesos
Be careful ： Use yarn For resource scheduler , To achieve AppalicationMaster Interface ,Spark Realization This interface , So it can be based on Yarn.

4、 What is? RDD

Elastic distributed data sets

RDD Characteristics ：
1、 By a series of partition Composed of .
2、 The function is applied to each partition Upper
3、RDD There's a dependency
4、 A comparator is used for KV Format RDD On
5、RDD Provide a series of best computing locations .

5、RDD Precautions for

1、RDD Do not store data ,RDD It's an iterator .
2、textFile Method the implementation of the underlying encapsulation MR How to read files .
3、RDD The data processed inside is a binary object
4、partition There is no limit to the number and size of
5、RDD By partition form ,partition Distributed on different nodes .

6、 establish RDD The way
Insert picture description here

7、Spark Operator classification

RDD The operators of are divided into ：
Conversion operator ：map,flatMap、reduceBy.Transformations The operator is delayed execution , Also called lazy load execution .
Action operator ：foreach、count.

8、Spark Code flow

1、 establish SparkConf object
You can set Application name.
Operation mode can be set
You can set Spark applicatiion The demand for resources .
2、 establish SparkContext object
3、 be based on Spark Create a context for RDD, Yes RDD To deal with .
4、 There should be Action Class operator to trigger Transformation Class operators execute .
5、 close Spark Context object SparkContext.

9、 Persistence operators

There are three kinds of control operators ,cache,persist,checkpoint, All of the above operators can change RDD Persistence , The unit of persistence is partition.cache and persist It's all lazy execution . There has to be one action Class operators trigger execution .checkpoint The operator can not only transform RDD Persist to disk , And cut off RDD Dependencies between .

10、Standalone Two processes and characteristics of task submission

	Standalone-client  Applicable to testing and debugging programs 
		1、client Submit the task and start the client Driver process .
		2、Driver towards Master Application resources 
		3、Master After receiving the request, the corresponding worker Start on the node Executor
		4、Executor Reverse register to after startup Driver End 
		5、Driver End will task Send to work End execution ,work The client also returns the execution result to Driver End 
	Standalone-cluster
		1、cluster Submit APP Back Master Request start Driver.
		2、Master After receiving the request , Start a random node in the cluster Driver.
		3、Driver Apply for resources after startup 
		4、Driver send out task To work Execution on node 
		5、work Return the result to Driver

summary Standalone Two ways to submit tasks ,Driver Communication with the cluster includes ：

Driver Responsible for application resource application
Distribution of tasks .
Recovery of results .
monitor task The implementation of .

11、yarn Two ways to submit a task

yarn-client client、ResourceManager、NodeManager
1、client To submit a APP, stay client Start a Driver process
2、APP Fang RS Send request to start AM(ApplicationMaster) Resources for
3、RS Receive a request , Randomly select one NM start-up AM
4、AM Starts to RS Application resources , To start Executor
5、RS Return resources to AM
6、AM towards NM Send command start Executor
7、Executor Reverse register to after startup Driver,Driver send out task to Executor, meanwhile executor Return the execution and results to Driver

yarn-cluster
1、client Submit APP Send to RS Request start AM
2、RS Receive the request at random NM Start the AM
3、AM Starts to RS Request resources
4、RS Return resources to AM
5、AM Send a command to NM start-up executor
6、executor Reverse registration to AM Where Driver,Driver send out task to Executor

Yarn-Cluster Mainly used in production environment

ApplicationMaster The role of ：

 For the current  Application  Application resources

 to  NameNode  Send message start  Excutor.

```
 Task scheduling .
```

 Stop cluster task command ：yarn application -kill applicationID

understand Spark Width and width dependence and calculation mode

Narrow dependence ： Father RDD Hezi RDD partition The relationship between One One on one , There will be no shuffle The birth of .

Wide dependence ： Father RDD And son RDD partition The relationship between them is one to many . There will be shuffle The birth of .
pipeline Pipeline computing mode ,pipeline It's just a computational idea , Pattern .

Spark Resource scheduling and task scheduling process

Resource scheduling =========================================

1、 start-up Master And spare Master（ If it is a highly available cluster, you need to start the standby Master, Otherwise, there is no backup Master）.
2、 start-up Worker node .Worker After the node is successfully started, it will send a message to Master register . stay works Add your own information to the collection .
3、 Submit... On the client side Application, start-up spark-submit process . Pseudo code ：spark-submit --master --deploy-mode cluster --class jarPath
4、Client towards Master by Driver Application resources . Application information arrives at Master In the after Master Of waitingDrivers Add this... To the collection Driver Application information for .
5、 When waitingDrivers Set is not empty , call schedule() Method ,Master lookup works aggregate , When the Work Node to start Driver. start-up Driver After success ,waitingDrivers The application information in the collection is removed .Client Client's spark-submit Process shutdown .
（Driver After successful startup , Will create DAGScheduler Objects and TaskSchedule object ）
6、 When TaskScheduler Once created , Will send to Master Meeting Application Application resources . Request sent to Master After the end, the waitingApps Add the application information to the collection .
7、 When waitingApps The elements in the collection change , Would call schedule() Method . lookup works aggregate , In compliance with the requirements worker Node to start Executor process .
8、 When Executor After the process is started successfully, it will waitingApps The application information in the collection is removed . And to the TaskSchedule Reverse registration . here TaskSchedule There are a batch of Executor List information for .

= Task scheduling ===

9、 according to RDD Depends on , cutting job, Divide stage. every last stage It's a group. task Composed of . every last task It's a pipleline Calculation mode .
10、TaskScheduler It will be distributed according to the data location task.（taskScheduler How to get the data location ？？？TaskSchedule call HDFS Of api, You get the data block Block and block Block location information ）
11、TaskSchedule distribution task And monitor task Implementation of .
12、 if task Execution fails or struggles . Will try this again task. It will retry three times by default .
13、 If you try again three times and still fail . Will take this. task Return to DAGScheduler,DAGScheduler Will retry this failure stage（ Only retry the failed one stage）. Retry four times by default .
14、 tell master, Put... In the cluster executor Kill , Release resources .

原网站

版权声明
本文为[Yiliubei]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202280532461843.html