当前位置：网站首页>Initial experience of Flink, a mainstream real-time stream processing computing framework

Initial experience of Flink, a mainstream real-time stream processing computing framework

2022-07-01 09:44:00 【InfoQ】

summary

Apache Flink By Apache Open source stream processing framework developed by Software Foundation , Its core is to use Java and Scala Distributed stream data stream engine written by .Flink Execute arbitrary stream data program in data parallel and pipeline mode ,Flink The pipeline runtime system can execute batch and streaming programs . Besides ,Flink The runtime itself supports the execution of iterative algorithms .== Baidu Encyclopedia ==

Flink Is a framework and distributed processing engine , It is used to calculate the state of unbounded and bounded data flow .Flink Designed to run in all common cluster environments , Perform calculations at the speed in memory and at any scale .==Apache Flink For distributed 、 High performance 、 Open source streaming framework built by ready to use and accurate streaming applications .==

characteristic

Low latency real-time stream processing

The code is easy to write Flink It has been one of the general big data frameworks in recent generations , Compared with a series of older generations, it is widely used 、 Easy to use .

Support large 、 Complex state processing allows hundreds of GB The above status is stored .

Supporting large-scale distributed deployment has its own Standalone Cluster pattern , It also supports deployment to Yarn、K8S On .

Fast iteration speed

Results accuracy and good fault tolerance

General scenarios used

There are a lot of machine resources ： Be able to provide at least 24 individual CPU Core and hundred GB Above memory ,Flink The hard disk of the machine must be SSD

Large throughput or future expansion requirements ： Ten thousand per second can only be considered as big , One hundred thousand can be considered big

The requirements are complex ： There are a lot of complex cleaning 、 duplicate removal 、 Switching, etc Very high requirements for low delay ：10

A delay of less than seconds can be counted as a low delay ,1 Delay requirements within seconds need to be handled very carefully

Event driven

Application of event driven type , It is a kind of stateful application , It extracts data from one or more event streams , And according to the event trigger calculation 、 Status updates or other external actions . The more typical is to kafka Almost all message queues represented by are event driven applications .

Stream processing and batch processing

Stream processing and batch processing are two different ways to process data , Next, let's learn the differences between the two in detail .

The batch

Batch processing is characterized by bounded 、 persistent 、 A lot of , Ideal for computing that requires access to a full set of records , It is generally used for offline statistics . let me put it another way , The trigger point of batch processing is data independent . Or it's a timed trigger , Or a certain number of triggers , Or a watch 、 Trigger after importing a set of files .

Stream processing

Stream processing is characterized by unbounded 、 real time , There is no need to perform operations on the entire dataset , Instead, it performs operations on each data item transmitted through the system , Generally used for real-time statistics . let me put it another way , The trigger point of stream processing is data related . Is an event driven architecture , Any part of it is to analyze the trigger related information and perform processing immediately after receiving a piece of data , for example offset、 for example time、 For example, a specific field value meets the requirements .

The difference between the two

Data timeliness

Streaming computing real time 、 Low latency ..| Batch processing is not real-time 、 High latency

Data characteristics

The data of flow computing is generally dynamic 、 There are no borders .| Batch data is generally static data .

Application scenarios

Streaming computing is applied in real-time scenarios , Scenes with high timeliness requirements , Like real-time recommendation 、 Business monitoring, etc .

Batch processing applications do not require high real-time performance 、 Offline computing scenario , For example, data analysis 、 Offline reports, etc .

Operation mode

The task of streaming computing is ongoing . | A batch is one or a series of one-time job

Processing efficiency

The efficiency of flow computing is generally low . Complete processing of any individual request , It is even necessary to perform compensation operations related to disorder and status ; A lot of computing resources need to be prepared around the clock , If there is flexible planning and scheduling, this problem can be greatly alleviated .

Batch processing is computationally efficient . Quickly execute large quantities of data at one time , There are a lot of similar compression 、SIMD And so on , Performance can easily be orders of magnitude higher than that of streaming computing ; Execute on demand , When it is not running normally, it can not consume any computing resources .

Flink Data processing in

stay flink In the world of , All data is made up of streams , Any type of data is generated as an event stream . Credit card transactions 、 Sensors measure 、 User interaction on machine logs or websites or mobile applications , All this data is generated as a stream , Offline data is a bounded stream , Real time data is a stream without boundaries , This is called bounded flow and unbounded flow .

By incident

There is a beginning but no defined end . They don't terminate and provide data when data is generated . Unbounded flows must be handled continuously , That is, the event must be handled immediately after ingestion . It is impossible to wait for all input data to arrive , Because the input is unbounded and will not be completed at any point in time . Processing unbounded data usually requires a specific order （ For example, the sequence of events ） Ingestion event , So that we can infer the integrity of the result .

== Unbounded data flow refers to data that has a beginning and no end , Once the data is generated, it will continue to generate new data , That is, the data has no time boundary . Unbounded data streams need to be processed continuously .==

Bounded flow

Having a defined beginning and end . The bounded flow can be processed by taking all the data before performing any calculation . Dealing with bounded flows does not require ordered ingestion , Because you can always sort bounded data sets . The processing of bounded flows is also called batch processing .

== Bounded data flow means that the input data has a beginning and an end . For example, the data may be one minute or one day's transaction data, etc ==

Flink Programming model （API）

The third layer is used for development , namely DataStrem/DataSetAPI. Users can use DataStream API Processing unbounded data streams , Use DataSet API Dealing with bounded data streams . At the same time, these two API All provide a variety of interfaces to process data . For example, common map、filter、flatMap wait , And support python,scala,java Programming language .

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207010936155154.html