当前位置：网站首页>Spark source code reading outline

Spark source code reading outline

2022-07-01 13:18:00 【Interest1_ wyt】

spark Used for so long , about driver、master、worker、BlockManage、RDD、DAGScheduler、TaskScheduler These concepts are more or less understood , But for the submission of its task ,driver、application Scheduling and registration of , Allocation of resources ,executor The creation of ,job To stage Until then task The segmentation process of ,hdfs Read and write operation of file data ,RDD Of itself map reduce operation , Persistence 、check point There is no systematic and in-depth understanding of the implementation of high reliability and fault tolerance , So I'm going to write a series of articles , Explore these problems from the perspective of source code , In order to deepen the understanding of spark The understanding of the . Here is just a list of my current interests , If there are other source code points you want to study later , It will also be added to the general outline and the subsequent series of articles .

What is installed in my virtual machine is spark3.0.1 edition , So the source code I downloaded is also this . In addition, because spark Too many source codes , In order to read more efficiently , Avoid being confused , So I first listed the problems I'm interested in , In the source code reading, we mainly focus on solving these problems . The list of questions is as follows ：

1、spark-submit How to integrate jar And configuration parameters are submitted to spark The server

2、spark How to start driver、application register 、executor Build command assembly

3、spark How to do driver、executor（application） Task scheduling , as well as executor towards driver Registration of

4、executor stay worker Creation process on , What is the essence of it , Is it a thread pool ？

5、DAGScheduler TaskScheduler How to cooperate in submitting tasks ,application、job、stage、taskset、task What is the correspondence ？

6、spark How to use BlockManager Control data reading and writing （ To be sorted out ）

7、 Persistence 、 cache 、checkpoint Functional differences and principles （ To be sorted out ）

There are various terms in the source code reading , Here, it is introduced in the general outline ：

Master:spark The primary node of the cluster , management spark Resource scheduling of other nodes .

Worker:spark The working node of the cluster , according to master Master node management , It creates and assigns certain resources to executor.

Executor:spark The lowest worker thread pool , from worker Create and allocate resources . The work to be performed by driver distribution .

Driver: The application code submitted by the user is in spark Running in is a driver, He is a special excutor process , This process is in addition to general excutor Both have operating environments , Still running DAGscheduler Tasksheduler Schedulerbackedn And so on .

Application: The general name of task execution submitted by users .

Job: from Action Generated by one or more stage Composed of calculation jobs

DAGScheduler: according to job Build on stage Of DAG, Its segmentation stage The basis is whether there is shuffle Operation occurs . This object will put each stage Submit to TaskScheduler Perform further task segmentation .

stage:job The next level of task operation granularity , from DAGScheduler Based on whether there is shuffle Operation for segmentation

TaskScheduler: receive DAGScheduler From here stage, Convert it to taskset Set of tasks （taskset Content and stage Same content ）, The final will be Taskset Issue to executor To deal with

TaskSet:stage Next level task operation granularity , from TaskScheduler Generate , Its content and stage Same content , The generation basis is the number of partitions of the data , There are several sections ,taskset There are several in it task.stage convert to taskset The main purpose of is to improve the parallelism of data processing .

Task:TaskSet The elements in the collection , It is also the smallest executable task granularity , from executor Scheduling execution .

RDD: Elastic distributed data sets , It can be simply understood as a data set

BlockManager:spark File manager in , management spark Reading and writing data in io operation

CheckPoint: Persist data to hdfs On a similar distributed file system , In this way, even if the local persistent data is lost , You can still go from hdfs In order to get , It increases the high availability and fault tolerance characteristics of the system

In addition, our series of source code tracking process , It's all based on one WordCount Program remote debug, The wordCount Specific information and remote debug The way can refer to this article ：IDEA Remote debugging spark-submit The submitted jar_Interest1_wyt The blog of -CSDN Blog

原网站

版权声明
本文为[Interest1_ wyt]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207011300271710.html

当前位置：网站首页>Spark source code reading outline

Spark source code reading outline

边栏推荐

猜你喜欢

随机推荐