当前位置:网站首页>Spark source code reading outline
Spark source code reading outline
2022-07-01 13:18:00 【Interest1_ wyt】
spark Used for so long , about driver、master、worker、BlockManage、RDD、DAGScheduler、TaskScheduler These concepts are more or less understood , But for the submission of its task ,driver、application Scheduling and registration of , Allocation of resources ,executor The creation of ,job To stage Until then task The segmentation process of ,hdfs Read and write operation of file data ,RDD Of itself map reduce operation , Persistence 、check point There is no systematic and in-depth understanding of the implementation of high reliability and fault tolerance , So I'm going to write a series of articles , Explore these problems from the perspective of source code , In order to deepen the understanding of spark The understanding of the . Here is just a list of my current interests , If there are other source code points you want to study later , It will also be added to the general outline and the subsequent series of articles .
What is installed in my virtual machine is spark3.0.1 edition , So the source code I downloaded is also this . In addition, because spark Too many source codes , In order to read more efficiently , Avoid being confused , So I first listed the problems I'm interested in , In the source code reading, we mainly focus on solving these problems . The list of questions is as follows :
1、spark-submit How to integrate jar And configuration parameters are submitted to spark The server
2、spark How to start driver、application register 、executor Build command assembly
3、spark How to do driver、executor(application) Task scheduling , as well as executor towards driver Registration of
4、executor stay worker Creation process on , What is the essence of it , Is it a thread pool ?
5、DAGScheduler TaskScheduler How to cooperate in submitting tasks ,application、job、stage、taskset、task What is the correspondence ?
6、spark How to use BlockManager Control data reading and writing ( To be sorted out )
7、 Persistence 、 cache 、checkpoint Functional differences and principles ( To be sorted out )
There are various terms in the source code reading , Here, it is introduced in the general outline :
Master:spark The primary node of the cluster , management spark Resource scheduling of other nodes .
Worker:spark The working node of the cluster , according to master Master node management , It creates and assigns certain resources to executor.
Executor:spark The lowest worker thread pool , from worker Create and allocate resources . The work to be performed by driver distribution .
Driver: The application code submitted by the user is in spark Running in is a driver, He is a special excutor process , This process is in addition to general excutor Both have operating environments , Still running DAGscheduler Tasksheduler Schedulerbackedn And so on .
Application: The general name of task execution submitted by users .
Job: from Action Generated by one or more stage Composed of calculation jobs
DAGScheduler: according to job Build on stage Of DAG, Its segmentation stage The basis is whether there is shuffle Operation occurs . This object will put each stage Submit to TaskScheduler Perform further task segmentation .
stage:job The next level of task operation granularity , from DAGScheduler Based on whether there is shuffle Operation for segmentation
TaskScheduler: receive DAGScheduler From here stage, Convert it to taskset Set of tasks (taskset Content and stage Same content ), The final will be Taskset Issue to executor To deal with
TaskSet:stage Next level task operation granularity , from TaskScheduler Generate , Its content and stage Same content , The generation basis is the number of partitions of the data , There are several sections ,taskset There are several in it task.stage convert to taskset The main purpose of is to improve the parallelism of data processing .
Task:TaskSet The elements in the collection , It is also the smallest executable task granularity , from executor Scheduling execution .
RDD: Elastic distributed data sets , It can be simply understood as a data set
BlockManager:spark File manager in , management spark Reading and writing data in io operation
CheckPoint: Persist data to hdfs On a similar distributed file system , In this way, even if the local persistent data is lost , You can still go from hdfs In order to get , It increases the high availability and fault tolerance characteristics of the system
In addition, our series of source code tracking process , It's all based on one WordCount Program remote debug, The wordCount Specific information and remote debug The way can refer to this article :IDEA Remote debugging spark-submit The submitted jar_Interest1_wyt The blog of -CSDN Blog
边栏推荐
- Flutter SQLite使用
- 1553B环境搭建
- SVG钻石样式代码
- 6. Wiper part
- Has anyone ever encountered this situation? When Oracle logminer is synchronized, the value of CLOB field is lost
- 请问flink mysql cdc 全量读取mysql某个表数据,对原始的mysql数据库有影响吗
- MySQL statistical bill information (Part 2): data import and query
- Feign & Eureka & Zuul & Hystrix 流程
- 啟動solr報錯The stack size specified is too small,Specify at least 328k
- Example code of second kill based on MySQL optimistic lock
猜你喜欢

内容审计技术

mysql统计账单信息(下):数据导入及查询

Feign & Eureka & zuul & hystrix process

9. Use of better scroll and ref

一款Flutter版的记事本

JS discolored Lego building blocks

Operator-1初识Operator

What is the future development direction of people with ordinary education, appearance and family background? The career planning after 00 has been made clear

The popular major I chose became "Tiankeng" four years later

ROS2 Foxy depthai_ ROS tutorial
随机推荐
Router. use() requires a middleware function but got a Object
1553B环境搭建
新手准备多少钱可以玩期货?农产品可以吗?
Jenkins+webhooks- multi branch parametric construction-
Use of shutter SQLite
6. Wiper part
Global and Chinese styrene acrylic lotion polymer development trend and prospect scale prediction report Ⓒ 2022 ~ 2028
Will it affect the original MySQL database to read the data of a MySQL table in full by flick MySQL CDC
Beidou communication module Beidou GPS module Beidou communication terminal DTU
请问flink mysql cdc 全量读取mysql某个表数据,对原始的mysql数据库有影响吗
c语言学习
codeforces -- 4B. Before an Exam
一款Flutter版的记事本
Investment analysis and prospect prediction report of global and Chinese dimethyl sulfoxide industry Ⓦ 2022 ~ 2028
Function test process in software testing
oracle cdc 数据传输时,clob类型字段,在update时值会丢失,update前有值,但
How to count the status of network sockets?
香港科技大学李泽湘教授:我错了,为什么工程意识比上最好的大学都重要?
王兴的无限游戏迎来“终极”一战
Flow management technology