当前位置:网站首页>Spark Foundation
Spark Foundation
2022-06-13 03:13:00 【Yiliubei】
1、 What is? Spark
spark It is a computing engine specially designed for big data processing .
2、Spark and MR The difference between
They're all distributed computing frameworks ,Spark It's based on memory ,MR be based on HDFS.Spark The speed of processing data is MR More than ten times ,Spark Processing is based on inner initial calculation and outer initial calculation , also DAC Directed acyclic graph to segment the execution sequence of tasks .
3、Spark The mode of operation of
Local: For local testing
Standalone: The self-contained resource scheduling framework .
yarn:
Mesos
Be careful : Use yarn For resource scheduler , To achieve AppalicationMaster Interface ,Spark Realization This interface , So it can be based on Yarn.
4、 What is? RDD
Elastic distributed data sets
RDD Characteristics :
1、 By a series of partition Composed of .
2、 The function is applied to each partition Upper
3、RDD There's a dependency
4、 A comparator is used for KV Format RDD On
5、RDD Provide a series of best computing locations .
5、RDD Precautions for
1、RDD Do not store data ,RDD It's an iterator .
2、textFile Method the implementation of the underlying encapsulation MR How to read files .
3、RDD The data processed inside is a binary object
4、partition There is no limit to the number and size of
5、RDD By partition form ,partition Distributed on different nodes .
6、 establish RDD The way 
7、Spark Operator classification
RDD The operators of are divided into :
Conversion operator :map,flatMap、reduceBy.Transformations The operator is delayed execution , Also called lazy load execution .
Action operator :foreach、count.
8、Spark Code flow
1、 establish SparkConf object
You can set Application name.
Operation mode can be set
You can set Spark applicatiion The demand for resources .
2、 establish SparkContext object
3、 be based on Spark Create a context for RDD, Yes RDD To deal with .
4、 There should be Action Class operator to trigger Transformation Class operators execute .
5、 close Spark Context object SparkContext.
9、 Persistence operators
There are three kinds of control operators ,cache,persist,checkpoint, All of the above operators can change RDD Persistence , The unit of persistence is partition.cache and persist It's all lazy execution . There has to be one action Class operators trigger execution .checkpoint The operator can not only transform RDD Persist to disk , And cut off RDD Dependencies between .
10、Standalone Two processes and characteristics of task submission
Standalone-client Applicable to testing and debugging programs
1、client Submit the task and start the client Driver process .
2、Driver towards Master Application resources
3、Master After receiving the request, the corresponding worker Start on the node Executor
4、Executor Reverse register to after startup Driver End
5、Driver End will task Send to work End execution ,work The client also returns the execution result to Driver End
Standalone-cluster
1、cluster Submit APP Back Master Request start Driver.
2、Master After receiving the request , Start a random node in the cluster Driver.
3、Driver Apply for resources after startup
4、Driver send out task To work Execution on node
5、work Return the result to Driver
summary Standalone Two ways to submit tasks ,Driver Communication with the cluster includes :
- Driver Responsible for application resource application
- Distribution of tasks .
- Recovery of results .
- monitor task The implementation of .
11、yarn Two ways to submit a task
yarn-client client、ResourceManager、NodeManager
1、client To submit a APP, stay client Start a Driver process
2、APP Fang RS Send request to start AM(ApplicationMaster) Resources for
3、RS Receive a request , Randomly select one NM start-up AM
4、AM Starts to RS Application resources , To start Executor
5、RS Return resources to AM
6、AM towards NM Send command start Executor
7、Executor Reverse register to after startup Driver,Driver send out task to Executor, meanwhile executor Return the execution and results to Driver
yarn-cluster
1、client Submit APP Send to RS Request start AM
2、RS Receive the request at random NM Start the AM
3、AM Starts to RS Request resources
4、RS Return resources to AM
5、AM Send a command to NM start-up executor
6、executor Reverse registration to AM Where Driver,Driver send out task to Executor
Yarn-Cluster Mainly used in production environment
ApplicationMaster The role of :
For the current Application Application resourcesto NameNode Send message start Excutor.Task scheduling .
Stop cluster task command :yarn application -kill applicationID
understand Spark Width and width dependence and calculation mode
Narrow dependence : Father RDD Hezi RDD partition The relationship between One One on one , There will be no shuffle The birth of .
Wide dependence : Father RDD And son RDD partition The relationship between them is one to many . There will be shuffle The birth of .
pipeline Pipeline computing mode ,pipeline It's just a computational idea , Pattern .
Spark Resource scheduling and task scheduling process
Resource scheduling =========================================
1、 start-up Master And spare Master( If it is a highly available cluster, you need to start the standby Master, Otherwise, there is no backup Master).
2、 start-up Worker node .Worker After the node is successfully started, it will send a message to Master register . stay works Add your own information to the collection .
3、 Submit... On the client side Application, start-up spark-submit process . Pseudo code :spark-submit --master --deploy-mode cluster --class jarPath
4、Client towards Master by Driver Application resources . Application information arrives at Master In the after Master Of waitingDrivers Add this... To the collection Driver Application information for .
5、 When waitingDrivers Set is not empty , call schedule() Method ,Master lookup works aggregate , When the Work Node to start Driver. start-up Driver After success ,waitingDrivers The application information in the collection is removed .Client Client's spark-submit Process shutdown .
(Driver After successful startup , Will create DAGScheduler Objects and TaskSchedule object )
6、 When TaskScheduler Once created , Will send to Master Meeting Application Application resources . Request sent to Master After the end, the waitingApps Add the application information to the collection .
7、 When waitingApps The elements in the collection change , Would call schedule() Method . lookup works aggregate , In compliance with the requirements worker Node to start Executor process .
8、 When Executor After the process is started successfully, it will waitingApps The application information in the collection is removed . And to the TaskSchedule Reverse registration . here TaskSchedule There are a batch of Executor List information for .
= Task scheduling ===
9、 according to RDD Depends on , cutting job, Divide stage. every last stage It's a group. task Composed of . every last task It's a pipleline Calculation mode .
10、TaskScheduler It will be distributed according to the data location task.(taskScheduler How to get the data location ???TaskSchedule call HDFS Of api, You get the data block Block and block Block location information )
11、TaskSchedule distribution task And monitor task Implementation of .
12、 if task Execution fails or struggles . Will try this again task. It will retry three times by default .
13、 If you try again three times and still fail . Will take this. task Return to DAGScheduler,DAGScheduler Will retry this failure stage( Only retry the failed one stage). Retry four times by default .
14、 tell master, Put... In the cluster executor Kill , Release resources .
边栏推荐
- Qt之QTreeView的简单使用(含源码+注释)
- Few-shot Unsupervised Domain Adaptation with Image-to-Class Sparse Similarity Encoding
- JVM virtual machine stack (III)
- Introduction to redis (using redis, common commands, persistence methods, and cluster operations)
- Delete the number of a range in the linked list
- Collection of IOS development interview and underlying learning videos
- Vant realizes the adaptation of mobile terminal
- Stack: daily temperature
- Three ways to start WPF project
- Ijkplayer source code - audio playback
猜你喜欢

SQL execution process in MySQL (3)

The extra money we made in those years

Prometheus node_exporter安装并注册为服务

C simple understanding - generics
![Loading process of [JVM series 3] classes](/img/a7/707c5cb95de71d95bf6ad9b2f69afa.jpg)
Loading process of [JVM series 3] classes

Keil去掉烦人的ST-Link更新提示

Linked list: palindrome linked list

JVM class loading (I)

Age anxiety? How to view the 35 year old programmer career crisis?

Available types in C #_ Unavailable type_ C double question mark_ C question mark point_ C null is not equal to
随机推荐
Control scanner in Delphi
Ijkplayer source code -- Library loading and initialization
brew工具-“fatal: Could not resolve HEAD to a revision”错误解决
Using linked list to find set union
. New features in net 6.0 _ What's new in net 6.0
[JVM series 8] overview of JVM knowledge points
C method parameter: ref
Linked list: delete the penultimate node of the linked list
Technology blog, a treasure trove of experience sharing
Use of jstack
Detailed explanation of curl command
Hash table: least recently used cache
Linked list: reverse linked list
Installing the IK word breaker
Prometheus install and register services
HEAP[xxx.exe]: Invalid address specified to RtlValidateHeap( 0xxxxxx, 0x000xx)
Coordinate location interface of wechat applet (II) map plug-in
Vs 2022 new features_ What's new in visual studio2022
Linked lists: rearranging linked lists
IOS development internal volume interview questions