当前位置：网站首页>Explain spark operation mode in detail (local+standalone+yarn)

Explain spark operation mode in detail (local+standalone+yarn)

2022-07-01 03:33:00 【YaoYong_ BigData】

One 、 sketch

Spark There are many modes of operation ：

1. Can run on a machine , be called Local（ Local ） Operation mode .
2. have access to Spark Its own resource scheduling system , be called Standalone Pattern .
3. have access to Yarn、Mesos、Kubernetes As the underlying resource scheduling system , be called Spark On Yarn、Spark On Mesos、Spark On K8S.

Two 、Client and Cluster Commit mode

Driver yes Spark Master process in , Responsible for executing the application main() Method , establish SparkContext object , Responsible for working with Spark Clusters interact , Submit Spark Homework , And convert the job to Task（ A job consists of multiple Task Task composition ）, And then in each Executor Interprocess pair Task Scheduling and monitoring .

Depending on how the application is submitted ,Driver The location in the cluster is also different , There are two main ways to submit applications ：Client and Cluster, The default is Client, You can go to Spark The cluster uses... When submitting applications --deploy-mode Parameter to specify how to submit .

1.client Pattern

1）client mode Next Driver The process is running in Master Node , be not in Worker Node , So relative to the actual calculation Worker In terms of clusters ,Driver It is equivalent to a third party “client”.

2） Due to Driver Process not present Worker Node , So it is independent , No consumption Worker Cluster resources .

3）client mode Next Master and Worker Nodes must be in the same LAN , because Drive Want to be with Executorr signal communication , for example Drive Need to put Jar Package pass Netty HTTP Distribute to Executor,Driver To give Executor Assign tasks, etc .

4）client mode There is no supervised restart mechanism ,Driver If the process hangs , Additional program restart required .

2.cluster Pattern

1）Driver The program is in worker A node in the cluster , Instead of Master node , But this node consists of Master Appoint .

2）Driver Program occupied Worker Resources for .

3）cluster mode Next Master have access to –supervise Yes Driver monitor , If Driver Hang up and restart automatically .

4）cluster mode Next Master Nodes and Worker Nodes are generally not in the same LAN , Therefore, it is impossible to Jar Packages are distributed to each Worker, therefore cluster mode It is required that Jar Package to each Worker Under the directory corresponding to several points .

3. So it's a choice client mode still cluster mode Well ？

Generally speaking , If the node that submits the task （ namely Master） and Worker The cluster is in the same network , here client mode More appropriate .

If the node submitting the task and Worker Clusters are far apart , Will adopt cluster mode To minimize Driver and Executor Network delay between .

4.Spark Operation mode configuration

according to Driver Where to run is divided into ：
Client Patterns and Cluster Pattern

The user is submitting the task to Spark When dealing with , The following two parameters determine together Spark Mode of operation .
· –master MASTER_URL ： To determine the Spark Which cluster processing does the task submit to .
· –deploy-mode DEPLOY_MODE： To determine the Driver Mode of operation , Optional value is Client perhaps Cluster.

Master URL	Meaning
local	Run locally , There is only one working process , No parallel computing capability .
local[K]	Run locally , Yes K Working process , Usually set K For the machine CPU The core number .
local[*]	Run locally , The number of working processes is equal to that of the machine CPU The core number .
spark://HOST:PORT	With Standalone mode , This is a Spark Cluster operation mode provided by itself , Default port number : 7077.
mesos-client	./spark-shell --master mesos://host:port --deploy-mode client
mesos-cluster	./spark-shell --master mesos://host:port --deploy-mode cluster
yarn-client	stay Yarn Running on a cluster ,Driver The process is local ,Work The process is in Yarn On the cluster ../spark-shell --master yarn --deploy-mode client.Yarn The cluster address must be at HADOOP_CONF_DIRorYARN_CONF_DIR Variables define
yarn-cluster	stay Yarn Running on a cluster ,Driver and Work The process is all in Yarn On the cluster ../spark-shell --master yarn --deploy-mode cluster.Yarn The cluster address must be at HADOOP_CONF_DIRorYARN_CONF_DIR Variables define

3、 ... and 、local Pattern

So-called Local Pattern , That is, it can execute locally without any other node resources Spark The environment of the code , Generally used for teaching , debugging , Demonstration, etc .

Official request PI Case study

[[email protected] spark]$ bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--executor-memory 1G \
--total-executor-cores 2 \
./examples/jars/spark-examples_2.11-2.1.1.jar \
100

Basic grammar ：

bin/spark-submit \
--class <main-class>
--master  <master-url>\
--deploy-mode  <deploy-mode>\
--conf <key>=<value> \
... # other options
 <application-jar>\
[application-arguments]

Parameter description ：

–master Appoint Master The address of , The default is Local
–class: The startup class of your application ( Such as org.apache.spark.examples.SparkPi)
–deploy-mode: Whether to release your driver to worker node (cluster) Or as a local client (client) (default: client)*
–conf: Any of the Spark Configuration properties , Format key=value. If the value contains spaces , You can use quotation marks “key=value”
application-jar: Packaged applications jar, Include dependencies . This URL It can be seen globally in the cluster . such as hdfs:// Shared storage system , If it is file:// path, So all the nodes path All contain the same jar
application-arguments: Pass to main() Method parameters
–executor-memory 1G Specify each executor Available memory is 1G
–total-executor-cores 2 Specify each executor The use of cup Auditing for 2 individual

Four 、Standalone Pattern

stay Spark Standalone In the pattern , Resource scheduling consists of Spark Self realized . Spark Standalone The pattern is Master-Slaves The cluster mode of Architecture , And most of them Master-Slaves The structure of the cluster is the same , There is a Master A single point of failure . For the problem of single point of failure ,Spark There are two options ：

File system based single point recovery （Single-Node Recovery with Local File System）, take Application and Worker The registration information of is written into the file , When Master outage , Can be restarted Master The process resumed working . This method is only applicable to development or test environment .
be based on Zookeeper Of Standby Masters（Standby Masters with ZooKeeper）.ZooKeeper Provides a Leader Election Mechanism , Using this mechanism can ensure that although there are multiple clusters Master, But only one is Active Of , Everything else is Standby. When Active Of Master Failure time , The other one Standby Master Will be elected , For applications running during recovery , because Application Has been sent to... Before running Master Applied for resources , Runtime Driver Responsible for working with Executor communicate , Manage the whole Application, therefore Master The fault of Application Will not affect the operation of , But it will affect new Application Submission of .

1.Standalone-client How to submit tasks

Submission of orders

./spark-submit

--master spark://node1:7077

--class org.apache.spark.examples.SparkPi

../lib/spark-examples-1.6.0-hadoop2.6.0.jar

1000

perhaps

./spark-submit

--master spark://node1:7077

--deploy-mode client

--class org.apache.spark.examples.SparkPi

../lib/spark-examples-1.6.0-hadoop2.6.0.jar

100

Diagram of execution principle

Execute the process

1）client After the pattern submits the task , It will start on the client side Driver process .
2）Driver Will send to Master Apply to start Application Starting resources .
3） Successful resource application ,Driver End will task Send to worker End execution .
4）worker take task The execution result is returned to Driver End .
summary :
client Pattern is suitable for testing debugging programs .Driver The process is started on the client side , The client here refers to the current node that submits the application . stay Driver End visible task Implementation . Cannot be used in production environment client Pattern , Because ： Suppose you want to submit 100 individual application Run to the cluster ,Driver Every time it's in client End boot , Then it will cause the client to 100 Secondary network card traffic explosion problem .

2.Standalone-cluster How to submit tasks

Submission of orders

./spark-submit

--master spark://node1:7077

--deploy-mode cluster

--class org.apache.spark.examples.SparkPi

../lib/spark-examples-1.6.0-hadoop2.6.0.jar

100

Diagram of execution principle

Execute the process

1）cluster After the pattern submits the application , Will send to Master Request start Driver.
2）Master Accept the request , Start randomly on one node in the cluster Driver process .
3）Driver After startup, request resources for the current application .
4）Driver End send task To worker Execution on node .
5）worker Returns the execution and execution results to the Driver End .

summary :

Driver The process is in a cluster Worker It started on , In the client is not able to view task Implementation of . Suppose you want to submit 100 individual application Run to the cluster , Every time Driver It will be randomly selected in a cluster Worker Start the , So this 100 The problem of secondary network card traffic explosion is scattered on the cluster .

5、 ... and 、yarn Pattern

Independent deployment （Standalone） Mode by Spark Provide its own computing resources , No other framework is required to provide resources . This approach reduces the coupling with other third-party resource frameworks , Very independent . But you should also remember ,Spark Mainly the calculation framework , Instead of a resource scheduling framework , Therefore, the resource scheduling provided by itself is not its strength , Therefore, it is more reliable to integrate with other professional resource scheduling frameworks .

Spark On Yarn The construction of the mode is relatively simple , Only need to Yarn Install on one node of the cluster Spark The client can , This node can be submitted as a Spark Application to Yarn Cluster clients .Spark Of itself Master Nodes and Worker The node does not need to be started . The premise is that we need to be ready Yarn colony .

1.yarn-client How to submit tasks

Submission of orders

./spark-submit

--master yarn

 --class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar

100

perhaps

./spark-submit

--master yarn–client

 --class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar

100

perhaps

./spark-submit

--master yarn

--deploy-mode  client

 --class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar

100

Diagram of execution principle

Execute the process

1） Client submits a Application, Start a Driver process .
2） After the application starts, it will send to RS(ResourceManager) Send a request , start-up AM(ApplicationMaster) Resources for .
3）RS Receive a request , Randomly select one NM(NodeManager) start-up AM. there NM amount to Standalone Medium Worker node .
4）AM After starting , Will send to RS Request a batch of container resources , Used to start Executor.
5）RS We'll find a batch NM Return to AM, Used to start Executor.
6）AM Will send to NM Send command start Executor.
7）Executor After starting , Will reverse register to Driver,Driver send out task To Executor, Execution and results returned to Driver End .

summary ：

Yarn-client Mode is also suitable for testing , because Driver Run locally ,Driver Will be with yarn In the cluster Executor Do a lot of communication , It will cause a large increase in client network card traffic .

ApplicationMaster The role of ：

For the current Application Application resources
to NameNode Send message start Executor.

Be careful ：ApplicationMaster Yes launchExecutor And functions of application resources , There is no job scheduling function .

2.yarn-cluster How to submit tasks

Submission of orders

./spark-submit

--master yarn

--deploy-mode cluster 

--class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar

100

perhaps

./spark-submit

--master yarn-cluster

--class org.apache.spark.examples.SparkPi ../lib/spark-examples-1.6.0-hadoop2.6.0.jar

100

Diagram of execution principle

Execute the process

1） Client submission Application Applications , Send a request to RS(ResourceManager), Request start AM(ApplicationMaster).
2）RS After receiving the request, randomly in one NM(NodeManager) Start the AM（ amount to Driver End ）.
3）AM start-up ,AM Send a request to RS, Request a batch of container Used to start Executor.
4）RS Return to batch NM Node to AM.
5）AM Connect to NM, Send a request to NM start-up Executor.
6）Executor Reverse registration to AM Of the node Driver.Driver send out task To Executor.

summary ：

Yarn-Cluster Mainly used in production environment , because Driver Running on the Yarn One in the cluster nodeManager in , For each task submitted Driver The machines are all random , There will be no surge of network card traffic of a certain machine , The disadvantage is that the log cannot be seen after the task is submitted . Only through yarn Check the log .