当前位置：网站首页>Flink analysis (I): basic concept analysis

Flink analysis (I): basic concept analysis

2022-07-06 17:27:00 【Stray_ Lambs】

Catalog

Flink Basic concepts

1、Job Manager

2、Task Manager

Task submission process

1、 Independent clusters (Standalone)

2、Yarn colony

Programs and data streams

Execution diagram

Data transmission form

Task chain （Operator Chains）

Reference resources

Flink Basic concepts

Currently in the real-time framework ,Flink It can be said to have a place .Flink It's a distributed system , Computing resources need to be effectively allocated and managed to execute streaming applications . It integrates all common cluster resource managers , for example Hadoop YARN、Apache Mesos and Kubernetes, But you can also set up to run as a separate cluster or even a library . also Flink It is a computing framework for flow batch integration , It can calculate finite data flow and infinite data flow with state or without state （ That is, it can be flow computing or batch computing ）.

Flink As Flow batch integration Framework , Stream processing uses DataStream, Batch processing uses DataSet. It consists of the following core components .

Flink Runtime (runtime) It consists of two types of processes ： One JobManager And one or more TaskManager. First, put a structure diagram of the official website , Get to know .

Here is a brief overview of the overall process ：

When Flink Once the cluster is started , First, a JobManager And one or more TaskManager. By the client Client Submit a task to JobManager,JobManager Then schedule tasks to each TaskManager To carry out , then TaskManager Report heartbeat and statistics to JobManager.TaskManager Data transmission in the form of stream . These are independent JVM process .

Client Submit for Job The client of , It can run on any machine , As long as it's with JobManager The environment is connected . Submit Job after ,Client You can end the process or wait for the result to return .
JobManager Mainly responsible for Client Received Job and Jar After the package and other resources , An optimized execution plan will be generated , And Task The unit is scheduled to each TaskManager To carry out .
TaskManager It is set at startup slot Slot number , Every slot Can start a Task, And Task For threads , This is the same as Spark Of fork Threads are similar . from JobManager Receive what needs to be deployed Task after , Establish with upstream Netty Connect , Receive data and process it .
Flink The communication between roles in the architecture uses Akka, such as JobManager And Client Communication between , The transmission between data uses Netty.

1、Job Manager

JobManager Is the job manager , Is a main process . Mainly responsible for ： Coordinate Distributed Computing ; Responsible for scheduling tasks ; Coordinate checkpoint; Coordinate fault recovery . It consists of the following three components ：

ResourceManager

ResourceManager be responsible for Flink Resource application in the cluster 、 Release 、 Distribute , stay HA in ,RM Also responsible for selection leader Job Manager. It manages task slots, This is a Flink The unit of resource scheduling in the cluster （ Please refer to TaskManagers）.Flink For different environmental and resource providers （ for example YARN、Mesos、Kubernetes and standalone Deploy ） The corresponding ResourceManager. stay standalone Setting up ,ResourceManager Only available... Can be assigned TaskManager Of slots, Instead of starting a new TaskManager.

Dispatcher

The dispenser ( Not necessary ).Dispatcher Provides a REST Interface (GET/PUT/DELETE/POST), To submit Flink Application execution , And start a new job for each submitted job JobMaster. It also runs Flink WebUI (localhost:8081) be used for ⽅ Conveniently display and monitor work execution ⾏ Information about .Dispatcher It may not be necessary in the architecture , It depends on ⽤ Delivery ⾏ Of ⽅ type .

JobMaster

JobMaster Responsible for managing individual JobGraph Implementation . Be responsible for receiving tasks , be responsible for JobManagerRunner（ Encapsulates the JobMaster） Start of .Flink Multiple jobs can run simultaneously in a cluster , Each assignment has its own JobMaster.

2、Task Manager

Task Manager( hereinafter referred to as TM) yes Flink Medium ⼯ Working process ( Task manager ). Usually in Flink There will be more than one TM shipment ⾏, Every time ⼀ individual Task Manager It's all about ⼀ A certain number of slots （slots）. The number of slots limits TM Be able to hold ⾏ Number of tasks . Each of them TM It's a JVM process , And every one slot It's a thread .

When a TM when ,TM Will send to Resource Manager Reverse register its slot; In the subsequent task scheduling , received RM Instructions will be provided after slot to JobManager.Job Manager You can go to slot Assign tasks to perform . In execution dataflow Medium task when ,TM Will cache and exchange data ( such as shuffle).

Task submission process

1、 Independent clusters (Standalone)

Insert picture description here

2、Yarn colony

Insert picture description here

Yarn Cluster mode ,Client Need to put jar Upload packages and related configurations to HDFS in , And to RM With Session Submit by session Job( namely jar package ). After submission , from RM In idle NodeManager Go to start ApplicationMaster As the application master node , It's a bit like Spark Medium Driver. Among them NodeManager Contains JobManger, then JobManager towards HDFS Loaded, packaged and uploaded jar Package and related configuration , Go where you need to go RM Apply to start TaskManager. When the startup completes the corresponding TaskManager after ,TM Will RM Reverse registration slot.

Programs and data streams

Flink The program consists of three parts ：Source、Transformation and Sink. among Source Responsible for reading data source ,Transformation Processing with various operators ,Sink Responsible for output .

During operation ,Flink The program that runs on will be mapped to “ Logical data flow ”(dataflows), Every dataflow With one or more source Start with one or more sinks end . every last dataflow Is similar to DAG Directed acyclic graph .

Execution diagram

Flink The execution diagram can be divided into four layers ：

StreamGraph->JobGraph->ExecutionGraph-> Physical execution diagram

StreamGraph： According to the user through Stream API The original diagram generated by the written code . Used to represent the topology of a program . stay Client In the middle of .
JobGraph：StreamGraph After optimization, it generates JobGraph, Submitted to JobManager Data structure of . The main optimization is , Multiple nodes that meet the conditions chain Together as a node . stay Client In the middle of .
ExecutionGraph：JobManager according to JobGraph Generate ExecutionGraph.ExecutionGraph yes JobGraph Parallel version of , It is the core data structure of scheduling layer . stay JobManager In the middle of .
Physical execution diagram ：JobManager according to ExecutionGraph Yes Job After scheduling , In all TaskManager Upper Department Task The diagram formed after , It's not a specific data structure .

PS： Later, I found that if you add the source code interpretation of each figure, the length is too long , I plan to put it separately in the next blog ...

Follow Spark equally ,Flink Also lazy execution , The user logic code will be in Flink After encapsulating and executing all the flow charts, start running .

every last operator( Be similar to Spark In the middle of rdd operator ) I'm going to generate the corresponding Transformation( such as Map Corresponding OneInputTransformation), Finally, it runs until StreamExecutionEnvironment.execute(), Similar to the implementation of Spark among action operator , To really execute the code , division DAG And all stages and tasks ,Flink So it starts to divide Flink Execution diagram and various task chains .

Data transmission form

And Spark Similarly ,Flink The different operators of also have a wide and narrow transmission form , It's just Flink be called One-to-one and Redistributing.

One-to-one( Don't involve shuffle The process )： It's a little similar to the feeling of being an only child , Upstream data has only one downstream receiving data .stream Maintaining the order of partitions and elements （ such as source and map Between ）. It means map The number and order of elements seen by the subtask of the operator follow source The number of elements produced by the subtask of the operator 、 Same order .map、fliter、flatMap And so on one-to-one Correspondence of .
Redistributing( involve shuffle The process )： A little similar to the feeling of multiple birth , Upstream data has multiple downstream received data .stream The partition of will change . The subtasks of each operator depend on the selected transformation Send data to different target tasks . for example ,keyBy be based on hashCode Repartition 、 and broadcast and rebalance It's going to be randomly repartitioned , These operators all cause redistribute The process , and redistribute The process is similar to Spark Medium shuffle The process .

Task chain （Operator Chains）

Flink An optimization technology called task chain is adopted , Because whether it's across NodeManager Data transmission is still cross Task The data transfer , May cause communication overhead , therefore Flink Reduce the cost of local communication under specific conditions . If the conditions are met , Then two or more operators can be connected by local forwarding .

Of the same degree of parallelism one-to-one operation ( And it needs to be in the same group ), Such operators can be linked together to form a task, The original operator becomes one of them subtask( Put a flag, Have the opportunity to write a task link source code interpretation …)

In the future, source code interpretation and principle understanding will be divided into two modules , Otherwise, the space is too large …