当前位置:网站首页>Spark source code reading outline
Spark source code reading outline
2022-07-01 13:18:00 【Interest1_ wyt】
spark Used for so long , about driver、master、worker、BlockManage、RDD、DAGScheduler、TaskScheduler These concepts are more or less understood , But for the submission of its task ,driver、application Scheduling and registration of , Allocation of resources ,executor The creation of ,job To stage Until then task The segmentation process of ,hdfs Read and write operation of file data ,RDD Of itself map reduce operation , Persistence 、check point There is no systematic and in-depth understanding of the implementation of high reliability and fault tolerance , So I'm going to write a series of articles , Explore these problems from the perspective of source code , In order to deepen the understanding of spark The understanding of the . Here is just a list of my current interests , If there are other source code points you want to study later , It will also be added to the general outline and the subsequent series of articles .
What is installed in my virtual machine is spark3.0.1 edition , So the source code I downloaded is also this . In addition, because spark Too many source codes , In order to read more efficiently , Avoid being confused , So I first listed the problems I'm interested in , In the source code reading, we mainly focus on solving these problems . The list of questions is as follows :
1、spark-submit How to integrate jar And configuration parameters are submitted to spark The server
2、spark How to start driver、application register 、executor Build command assembly
3、spark How to do driver、executor(application) Task scheduling , as well as executor towards driver Registration of
4、executor stay worker Creation process on , What is the essence of it , Is it a thread pool ?
5、DAGScheduler TaskScheduler How to cooperate in submitting tasks ,application、job、stage、taskset、task What is the correspondence ?
6、spark How to use BlockManager Control data reading and writing ( To be sorted out )
7、 Persistence 、 cache 、checkpoint Functional differences and principles ( To be sorted out )
There are various terms in the source code reading , Here, it is introduced in the general outline :
Master:spark The primary node of the cluster , management spark Resource scheduling of other nodes .
Worker:spark The working node of the cluster , according to master Master node management , It creates and assigns certain resources to executor.
Executor:spark The lowest worker thread pool , from worker Create and allocate resources . The work to be performed by driver distribution .
Driver: The application code submitted by the user is in spark Running in is a driver, He is a special excutor process , This process is in addition to general excutor Both have operating environments , Still running DAGscheduler Tasksheduler Schedulerbackedn And so on .
Application: The general name of task execution submitted by users .
Job: from Action Generated by one or more stage Composed of calculation jobs
DAGScheduler: according to job Build on stage Of DAG, Its segmentation stage The basis is whether there is shuffle Operation occurs . This object will put each stage Submit to TaskScheduler Perform further task segmentation .
stage:job The next level of task operation granularity , from DAGScheduler Based on whether there is shuffle Operation for segmentation
TaskScheduler: receive DAGScheduler From here stage, Convert it to taskset Set of tasks (taskset Content and stage Same content ), The final will be Taskset Issue to executor To deal with
TaskSet:stage Next level task operation granularity , from TaskScheduler Generate , Its content and stage Same content , The generation basis is the number of partitions of the data , There are several sections ,taskset There are several in it task.stage convert to taskset The main purpose of is to improve the parallelism of data processing .
Task:TaskSet The elements in the collection , It is also the smallest executable task granularity , from executor Scheduling execution .
RDD: Elastic distributed data sets , It can be simply understood as a data set
BlockManager:spark File manager in , management spark Reading and writing data in io operation
CheckPoint: Persist data to hdfs On a similar distributed file system , In this way, even if the local persistent data is lost , You can still go from hdfs In order to get , It increases the high availability and fault tolerance characteristics of the system
In addition, our series of source code tracking process , It's all based on one WordCount Program remote debug, The wordCount Specific information and remote debug The way can refer to this article :IDEA Remote debugging spark-submit The submitted jar_Interest1_wyt The blog of -CSDN Blog
边栏推荐
- 图灵奖得主Judea Pearl:最近值得一读的19篇因果推断论文
- Reasons for MySQL reporting 1040too many connections and Solutions
- Operator-1初识Operator
- 波浪动画彩色五角星loader加载js特效
- Idea of [developing killer]
- Content Audit Technology
- c语言学习
- Declare an abstract class vehicle, which contains the private variable numofwheel and the public functions vehicle (int), horn (), setnumofwheel (int) and getnumofwheel (). Subclass mot
- How to count the status of network sockets?
- China NdYAG crystal market research conclusion and development strategy proposal report Ⓥ 2022 ~ 2028
猜你喜欢

软件测试中功能测试流程

Feign & Eureka & Zuul & Hystrix 流程

CS5268优势替代AG9321MCQ Typec多合一扩展坞方案

Vs code set code auto save

I spent tens of thousands of dollars to learn and bring goods: I earned 3 yuan in three days, and the transaction depends on the bill
![[development of large e-commerce projects] performance pressure test - basic concept of pressure test & jmeter-38](/img/50/819b9c2f69534afc6dc391c9de5f05.png)
[development of large e-commerce projects] performance pressure test - basic concept of pressure test & jmeter-38

Feign & Eureka & zuul & hystrix process

Content Audit Technology

Detailed explanation of OSPF LSA of routing Foundation

彩色五角星SVG动态网页背景js特效
随机推荐
Redis exploration: cache breakdown, cache avalanche, cache penetration
Three questions about scientific entrepreneurship: timing, pain points and important decisions
Redis explores cache consistency
Look at the sky at dawn and the clouds at dusk, and enjoy the beautiful pictures
请问flink mysql cdc 全量读取mysql某个表数据,对原始的mysql数据库有影响吗
5. Use of ly tab plug-in of header component
【大型电商项目开发】性能压测-压力测试基本概念&JMeter-38
软件测试中功能测试流程
Research Report on China's software outsourcing industry investment strategy and the 14th five year plan Ⓡ 2022 ~ 2028
MySQL statistical bill information (Part 2): data import and query
Router. use() requires a middleware function but got a Object
Sharing with the best paper winner of CV Summit: how is a good paper refined?
mysql统计账单信息(下):数据导入及查询
7. Icons
leetcode 322. Coin change (medium)
Global and Chinese n-butanol acetic acid market development trend and prospect forecast report Ⓧ 2022 ~ 2028
Asp. NETCORE uses dynamic to simplify database access
The stack size specified is too small, specify at least 328k
游戏公会在去中心化游戏中的未来
Report on the current situation and development trend of bidirectional polypropylene composite film industry in the world and China Ⓟ 2022 ~ 2028