当前位置:网站首页>Flink analysis (I): basic concept analysis
Flink analysis (I): basic concept analysis
2022-07-06 17:27:00 【Stray_ Lambs】
Catalog
1、 Independent clusters (Standalone)
Flink Basic concepts
Currently in the real-time framework ,Flink It can be said to have a place .Flink It's a distributed system , Computing resources need to be effectively allocated and managed to execute streaming applications . It integrates all common cluster resource managers , for example Hadoop YARN、Apache Mesos and Kubernetes, But you can also set up to run as a separate cluster or even a library . also Flink It is a computing framework for flow batch integration , It can calculate finite data flow and infinite data flow with state or without state ( That is, it can be flow computing or batch computing ).
Flink As Flow batch integration Framework , Stream processing uses DataStream, Batch processing uses DataSet. It consists of the following core components .
Flink Runtime (runtime) It consists of two types of processes : One JobManager And one or more TaskManager. First, put a structure diagram of the official website , Get to know .
Here is a brief overview of the overall process :
When Flink Once the cluster is started , First, a JobManager And one or more TaskManager. By the client Client Submit a task to JobManager,JobManager Then schedule tasks to each TaskManager To carry out , then TaskManager Report heartbeat and statistics to JobManager.TaskManager Data transmission in the form of stream . These are independent JVM process .
- Client Submit for Job The client of , It can run on any machine , As long as it's with JobManager The environment is connected . Submit Job after ,Client You can end the process or wait for the result to return .
- JobManager Mainly responsible for Client Received Job and Jar After the package and other resources , An optimized execution plan will be generated , And Task The unit is scheduled to each TaskManager To carry out .
- TaskManager It is set at startup slot Slot number , Every slot Can start a Task, And Task For threads , This is the same as Spark Of fork Threads are similar . from JobManager Receive what needs to be deployed Task after , Establish with upstream Netty Connect , Receive data and process it .
- Flink The communication between roles in the architecture uses Akka, such as JobManager And Client Communication between , The transmission between data uses Netty.
1、Job Manager
JobManager Is the job manager , Is a main process . Mainly responsible for : Coordinate Distributed Computing ; Responsible for scheduling tasks ; Coordinate checkpoint; Coordinate fault recovery . It consists of the following three components :
- ResourceManager
ResourceManager be responsible for Flink Resource application in the cluster 、 Release 、 Distribute , stay HA in ,RM Also responsible for selection leader Job Manager. It manages task slots, This is a Flink The unit of resource scheduling in the cluster ( Please refer to TaskManagers).Flink For different environmental and resource providers ( for example YARN、Mesos、Kubernetes and standalone Deploy ) The corresponding ResourceManager. stay standalone Setting up ,ResourceManager Only available... Can be assigned TaskManager Of slots, Instead of starting a new TaskManager.
- Dispatcher
The dispenser ( Not necessary ).Dispatcher Provides a REST Interface (GET/PUT/DELETE/POST), To submit Flink Application execution , And start a new job for each submitted job JobMaster. It also runs Flink WebUI (localhost:8081) be used for ⽅ Conveniently display and monitor work execution ⾏ Information about .Dispatcher It may not be necessary in the architecture , It depends on ⽤ Delivery ⾏ Of ⽅ type .
- JobMaster
JobMaster Responsible for managing individual JobGraph Implementation . Be responsible for receiving tasks , be responsible for JobManagerRunner( Encapsulates the JobMaster) Start of .Flink Multiple jobs can run simultaneously in a cluster , Each assignment has its own JobMaster.
2、Task Manager
Task Manager( hereinafter referred to as TM) yes Flink Medium ⼯ Working process ( Task manager ). Usually in Flink There will be more than one TM shipment ⾏, Every time ⼀ individual Task Manager It's all about ⼀ A certain number of slots (slots). The number of slots limits TM Be able to hold ⾏ Number of tasks . Each of them TM It's a JVM process , And every one slot It's a thread .
When a TM when ,TM Will send to Resource Manager Reverse register its slot; In the subsequent task scheduling , received RM Instructions will be provided after slot to JobManager.Job Manager You can go to slot Assign tasks to perform . In execution dataflow Medium task when ,TM Will cache and exchange data ( such as shuffle).
Task submission process
1、 Independent clusters (Standalone)
2、Yarn colony
Yarn Cluster mode ,Client Need to put jar Upload packages and related configurations to HDFS in , And to RM With Session Submit by session Job( namely jar package ). After submission , from RM In idle NodeManager Go to start ApplicationMaster As the application master node , It's a bit like Spark Medium Driver. Among them NodeManager Contains JobManger, then JobManager towards HDFS Loaded, packaged and uploaded jar Package and related configuration , Go where you need to go RM Apply to start TaskManager. When the startup completes the corresponding TaskManager after ,TM Will RM Reverse registration slot.
Programs and data streams
Flink The program consists of three parts :Source、Transformation and Sink. among Source Responsible for reading data source ,Transformation Processing with various operators ,Sink Responsible for output .
During operation ,Flink The program that runs on will be mapped to “ Logical data flow ”(dataflows), Every dataflow With one or more source Start with one or more sinks end . every last dataflow Is similar to DAG Directed acyclic graph .
Execution diagram
Flink The execution diagram can be divided into four layers :
StreamGraph->JobGraph->ExecutionGraph-> Physical execution diagram
- StreamGraph: According to the user through Stream API The original diagram generated by the written code . Used to represent the topology of a program . stay Client In the middle of .
- JobGraph:StreamGraph After optimization, it generates JobGraph, Submitted to JobManager Data structure of . The main optimization is , Multiple nodes that meet the conditions chain Together as a node . stay Client In the middle of .
- ExecutionGraph:JobManager according to JobGraph Generate ExecutionGraph.ExecutionGraph yes JobGraph Parallel version of , It is the core data structure of scheduling layer . stay JobManager In the middle of .
- Physical execution diagram :JobManager according to ExecutionGraph Yes Job After scheduling , In all TaskManager Upper Department Task The diagram formed after , It's not a specific data structure .
PS: Later, I found that if you add the source code interpretation of each figure, the length is too long , I plan to put it separately in the next blog ...
Follow Spark equally ,Flink Also lazy execution , The user logic code will be in Flink After encapsulating and executing all the flow charts, start running .
every last operator( Be similar to Spark In the middle of rdd operator ) I'm going to generate the corresponding Transformation( such as Map Corresponding OneInputTransformation), Finally, it runs until StreamExecutionEnvironment.execute(), Similar to the implementation of Spark among action operator , To really execute the code , division DAG And all stages and tasks ,Flink So it starts to divide Flink Execution diagram and various task chains .
Data transmission form
And Spark Similarly ,Flink The different operators of also have a wide and narrow transmission form , It's just Flink be called One-to-one and Redistributing.
- One-to-one( Don't involve shuffle The process ): It's a little similar to the feeling of being an only child , Upstream data has only one downstream receiving data .stream Maintaining the order of partitions and elements ( such as source and map Between ). It means map The number and order of elements seen by the subtask of the operator follow source The number of elements produced by the subtask of the operator 、 Same order .map、fliter、flatMap And so on one-to-one Correspondence of .
- Redistributing( involve shuffle The process ): A little similar to the feeling of multiple birth , Upstream data has multiple downstream received data .stream The partition of will change . The subtasks of each operator depend on the selected transformation Send data to different target tasks . for example ,keyBy be based on hashCode Repartition 、 and broadcast and rebalance It's going to be randomly repartitioned , These operators all cause redistribute The process , and redistribute The process is similar to Spark Medium shuffle The process .
Task chain (Operator Chains)
Flink An optimization technology called task chain is adopted , Because whether it's across NodeManager Data transmission is still cross Task The data transfer , May cause communication overhead , therefore Flink Reduce the cost of local communication under specific conditions . If the conditions are met , Then two or more operators can be connected by local forwarding .
- Of the same degree of parallelism one-to-one operation ( And it needs to be in the same group ), Such operators can be linked together to form a task, The original operator becomes one of them subtask( Put a flag, Have the opportunity to write a task link source code interpretation …)
In the future, source code interpretation and principle understanding will be divided into two modules , Otherwise, the space is too large …
Reference resources
0006-Flink principle (Flink Data flow & Execution diagram ) - Programmer base
边栏推荐
- Only learning C can live up to expectations Top1 environment configuration
- 信息与网络安全期末复习(完整版)
- 03个人研发的产品及推广-计划服务配置器V3.0
- Serial serialold parnew of JVM garbage collector
- 关于Stream和Map的巧用
- Display picture of DataGridView cell in C WinForm
- 04 products and promotion developed by individuals - data push tool
- DataGridView scroll bar positioning in C WinForm
- Instructions for Redux
- JUnit unit test
猜你喜欢
Activit fragmented deadly pit
Take you hand-in-hand to do intensive learning experiments -- knock the level in detail
ByteDance overseas technical team won the championship again: HD video coding has won the first place in 17 items
JVM 垃圾回收器之Serial SerialOld ParNew
Models used in data warehouse modeling and layered introduction
Wu Jun's trilogy experience (VII) the essence of Commerce
06个人研发的产品及推广-代码统计工具
06 products and promotion developed by individuals - code statistical tools
华为认证云计算HICA
05 personal R & D products and promotion - data synchronization tool
随机推荐
基于LNMP部署flask项目
Junit单元测试
Akamai浅谈风控原理与解决方案
06 products and promotion developed by individuals - code statistical tools
JVM 垃圾回收器之Garbage First
About selenium starting Chrome browser flash back
Wu Jun's trilogy experience (VII) the essence of Commerce
Idea breakpoint debugging skills, multiple dynamic diagram package teaching package meeting.
[VNCTF 2022]ezmath wp
Take you hand-in-hand to do intensive learning experiments -- knock the level in detail
Final review of information and network security (based on the key points given by the teacher)
Based on infragistics Document. Excel export table class
Flink源码解读(三):ExecutionGraph源码解读
Coursera cannot play video
Flink 解析(四):恢复机制
Display picture of DataGridView cell in C WinForm
Control transfer instruction
[mmdetection] solves the installation problem
Detailed explanation of data types of MySQL columns
關於Stream和Map的巧用