当前位置:网站首页>Initial experience of Flink, a mainstream real-time stream processing computing framework
Initial experience of Flink, a mainstream real-time stream processing computing framework
2022-07-01 09:44:00 【InfoQ】
summary
Apache Flink By Apache Open source stream processing framework developed by Software Foundation , Its core is to use Java and Scala Distributed stream data stream engine written by .Flink Execute arbitrary stream data program in data parallel and pipeline mode ,Flink The pipeline runtime system can execute batch and streaming programs . Besides ,Flink The runtime itself supports the execution of iterative algorithms .== Baidu Encyclopedia ==
Flink Is a framework and distributed processing engine , It is used to calculate the state of unbounded and bounded data flow .Flink Designed to run in all common cluster environments , Perform calculations at the speed in memory and at any scale .==Apache Flink For distributed 、 High performance 、 Open source streaming framework built by ready to use and accurate streaming applications .==

characteristic
- Low latency real-time stream processing
- The code is easy to write Flink It has been one of the general big data frameworks in recent generations , Compared with a series of older generations, it is widely used 、 Easy to use .
- Support large 、 Complex state processing allows hundreds of GB The above status is stored .
- Supporting large-scale distributed deployment has its own Standalone Cluster pattern , It also supports deployment to Yarn、K8S On .
- Fast iteration speed
- Results accuracy and good fault tolerance
General scenarios used
- There are a lot of machine resources : Be able to provide at least 24 individual CPU Core and hundred GB Above memory ,Flink The hard disk of the machine must be SSD
- Large throughput or future expansion requirements : Ten thousand per second can only be considered as big , One hundred thousand can be considered big
- The requirements are complex : There are a lot of complex cleaning 、 duplicate removal 、 Switching, etc Very high requirements for low delay :10
- A delay of less than seconds can be counted as a low delay ,1 Delay requirements within seconds need to be handled very carefully
Event driven
Application of event driven type , It is a kind of stateful application , It extracts data from one or more event streams , And according to the event trigger calculation 、 Status updates or other external actions . The more typical is to kafka Almost all message queues represented by are event driven applications .

Stream processing and batch processing
Stream processing and batch processing are two different ways to process data , Next, let's learn the differences between the two in detail .
The batch
Batch processing is characterized by bounded 、 persistent 、 A lot of , Ideal for computing that requires access to a full set of records , It is generally used for offline statistics . let me put it another way , The trigger point of batch processing is data independent . Or it's a timed trigger , Or a certain number of triggers , Or a watch 、 Trigger after importing a set of files .
Stream processing
Stream processing is characterized by unbounded 、 real time , There is no need to perform operations on the entire dataset , Instead, it performs operations on each data item transmitted through the system , Generally used for real-time statistics . let me put it another way , The trigger point of stream processing is data related . Is an event driven architecture , Any part of it is to analyze the trigger related information and perform processing immediately after receiving a piece of data , for example offset、 for example time、 For example, a specific field value meets the requirements .
The difference between the two
- Data timeliness
Streaming computing real time 、 Low latency ..| Batch processing is not real-time 、 High latency
- Data characteristics
The data of flow computing is generally dynamic 、 There are no borders .| Batch data is generally static data .
- Application scenarios
Streaming computing is applied in real-time scenarios , Scenes with high timeliness requirements , Like real-time recommendation 、 Business monitoring, etc .
Batch processing applications do not require high real-time performance 、 Offline computing scenario , For example, data analysis 、 Offline reports, etc .
- Operation mode
The task of streaming computing is ongoing . | A batch is one or a series of one-time job
- Processing efficiency
The efficiency of flow computing is generally low . Complete processing of any individual request , It is even necessary to perform compensation operations related to disorder and status ; A lot of computing resources need to be prepared around the clock , If there is flexible planning and scheduling, this problem can be greatly alleviated .
Batch processing is computationally efficient . Quickly execute large quantities of data at one time , There are a lot of similar compression 、SIMD And so on , Performance can easily be orders of magnitude higher than that of streaming computing ; Execute on demand , When it is not running normally, it can not consume any computing resources .
Flink Data processing in
stay flink In the world of , All data is made up of streams , Any type of data is generated as an event stream . Credit card transactions 、 Sensors measure 、 User interaction on machine logs or websites or mobile applications , All this data is generated as a stream , Offline data is a bounded stream , Real time data is a stream without boundaries , This is called bounded flow and unbounded flow .
By incident
By incident
There is a beginning but no defined end . They don't terminate and provide data when data is generated . Unbounded flows must be handled continuously , That is, the event must be handled immediately after ingestion . It is impossible to wait for all input data to arrive , Because the input is unbounded and will not be completed at any point in time . Processing unbounded data usually requires a specific order ( For example, the sequence of events ) Ingestion event , So that we can infer the integrity of the result .
== Unbounded data flow refers to data that has a beginning and no end , Once the data is generated, it will continue to generate new data , That is, the data has no time boundary . Unbounded data streams need to be processed continuously .==
Bounded flow
Bounded flow
Having a defined beginning and end . The bounded flow can be processed by taking all the data before performing any calculation . Dealing with bounded flows does not require ordered ingestion , Because you can always sort bounded data sets . The processing of bounded flows is also called batch processing .
== Bounded data flow means that the input data has a beginning and an end . For example, the data may be one minute or one day's transaction data, etc ==

Flink Programming model (API)

The third layer is used for development , namely DataStrem/DataSetAPI. Users can use DataStream API Processing unbounded data streams , Use DataSet API Dealing with bounded data streams . At the same time, these two API All provide a variety of interfaces to process data . For example, common map、filter、flatMap wait , And support python,scala,java Programming language .
边栏推荐
猜你喜欢
随机推荐
谁拥有穿越周期的眼光?
Short circuit operator lazy evaluation
Analysis and solution of JS this loss
直播管理项目
A 419 error occurred in the laravel postman submission form. July 6th, 2020 diary.
Network partition notes
js this丢失问题分析 及 解决方案
韦东山板子编译内核问题解决
SQL学习笔记(02)——数据库表操作
Clickhouse: Test on query speed of A-share minute data [Part 2]
一个悄然崛起的国产软件,低调又强大!
“中移链”国密引擎在BSN正式上线
Swag init error: cannot find type definition: response Response
7-Zip 遭抵制?呼吁者定下“三宗罪”:伪开源、不安全、作者来自俄罗斯!
ESP8266 FreeRTOS开发环境搭建
Swift control encapsulation - paging controller
PHP string to binary conversion
Hololens2 development -6-eyetracking and speech recognition
JS prototype chain
Configure load balancing









