当前位置：网站首页>What is Flink? What can Flink do?

What is Flink? What can Flink do?

2022-06-25 12:09:00 【Ink mark vs. breeze】

summary

Apache Flink It's a frame and Distributed Processing engine , Used in Borderless and Boundary On the data stream A stateful The calculation of .Flink Can run in all common cluster environments , And can be calculated at memory speed and any scale .

Apache Flink Powerful , Support the development and running of many different kinds of applications . Its main features include ： Batch flow integration 、 Sophisticated state management 、 Event time support and accurate state consistency guarantee etc .Flink Not only can it run in include YARN、 Mesos、Kubernetes In the framework of multiple resource management , It also supports independent deployment on bare metal cluster . With the high availability option enabled , It has no single point of failure . The fact proved that ,Flink It can be extended to thousands of cores , Its state can be achieved TB Level , And can still maintain high throughput 、 Low latency features . There are many demanding stream processing applications running around the world Flink above .

Next , Let's introduce it Flink Some important concepts in .

Batch and stream

The batch Is characterized by bounded 、 persistent 、 A lot of , Ideal for computing that requires access to a full set of records , It is generally used for offline statistics .
Stream processing Is characterized by boundlessness 、 real time , There is no need to perform operations on the entire dataset , Instead, it performs operations on each data item transmitted through the system , Generally used for real-time statistics .

stay Spark In my world view , It's all about batch Composed of , Offline data is a large batch , Real time data is composed of infinite small batches . And in the Flink In my world view , It's all about flow Composed of , Offline data is a bounded stream , Real time data is a stream without boundaries , This is called bounded flow and unbounded flow .

By incident ： There is the beginning of a defined flow , But the end of the flow is not defined . They produce data endlessly . The data of unbounded flow must be processed continuously , That is, the data needs to be processed immediately after being ingested . We can't wait until all the data arrives , Because the input is infinite , The input will not be completed at any time . Processing unbounded data usually requires ingesting events in a specific order , For example, the sequence of events , So that we can infer the integrity of the result .
Bounded flow ： There is the beginning of a defined flow , It also defines the end of the flow . A bounded flow can compute after taking all the data . All data in a bounded flow can be sorted , So there's no need for an orderly intake . Bounded flow processing is often referred to as batch processing .

Flink Good at dealing with unbounded and bounded data sets Precise time control and statefulness make Flink Runtime (runtime) Be able to run any application that handles unbounded flows . Bounded flows are handled internally by algorithms and data structures specially designed for fixed size datasets , Produces excellent performance .

Deploy it anywhere

Apache Flink It's a distributed system , It requires computing resources to execute the application .Flink Integrates all the common cluster resource managers , for example Hadoop YARN、 Apache Mesos and Kubernetes, But it can also run as a stand-alone cluster .

Flink Designed to work well in each of the above resource managers , This is done through the resource manager (resource-manager-specific) Implementation of the deployment pattern .Flink You can interact in a manner appropriate to the current explorer .

Deploy Flink Application time ,Flink The required resources are automatically identified according to the parallelism of the application configuration , And request these resources from the resource manager . In case of failure ,Flink Replace the failed container by requesting new resources . All communication to submit or control the application is through REST The call made , This can simplify Flink Integration with various environments .

Take advantage of memory performance

Stateful Flink The program is optimized for local state access . The state of the task is always kept in memory , If the state size exceeds the available memory , It will be saved in a disk data structure that can be accessed efficiently . Tasks are accessed locally by （ Usually in memory ） State to perform all calculations , This results in very low processing latency .Flink The local state is persisted and stored periodically and asynchronously to ensure the accurate state consistency in the fault scenario .

layered API

Flink According to the degree of abstraction , Three different ones are offered API. Each of these API There are different emphases on conciseness and expressiveness , And for different application scenarios .

ProcessFunction： It can handle a single event in one or two input data streams or multiple events in a specific window . It provides fine-grained control of time and state . The developer can modify the state at will , You can also register a timer to trigger a callback function at some point in the future . therefore , You can use ProcessFunction Complex business logic based on single event is needed to implement many stateful event driven applications .
DataStream API： Provides processing primitives for many common stream processing operations . These operations include windows 、 Record by record conversion operation , External database query when handling events, etc .DataStream API Support Java and Scala Language , Pre defined, for example map()、reduce()、aggregate() Such as function . You can implement predefined interfaces through extensions or use Java、Scala Of lambda Expressions implement custom functions .
SQL & Table API：Flink Supporting two types of relationship API,Table API and SQL. these two items. API Both batch and stream processing are unified API, This means on the borderless real-time data stream and the bounded historical data stream , Relational type API The query will be executed with the same semantics , And produce the same result .Table API and SQL With the help of Apache Calcite To parse the query , Verification and optimization . They can be associated with DataStream and DataSet API Seamless integration , And support user-defined scalar functions , Aggregate functions and table valued functions .Flink Relation type of API Designed to simplify data analysis 、 Data pipeline and ETL Definition of application .

characteristic

Apache Flink It's a aggregate Many competitive third-generation stream processing engines , Its following characteristics make it stand out from other similar systems .

1. At the same time, it supports high throughput 、 Low latency 、 High performance .

Flink It is the only set of high throughput in the open source community 、 Low latency 、 A high-performance three in one distributed streaming processing framework . image Apache Spark It can only take into account the characteristics of high throughput and high performance , Mainly because of Spark Streaming Low delay guarantee cannot be achieved in streaming computing ; And the flow computing framework Apache Storm Only low latency and high performance features are supported , But it can't meet the requirement of high throughput .

2. Both event time and processing time semantics are supported .

In the field of flow computing , Window computing is very important , But at present, most frame window calculations use processing time , That is, the current time of the system host when the event is transferred to the computing framework for processing .Flink It can support window calculation based on event time semantics , That is, using the time when the event occurred , This event driven mechanism enables events to arrive even in disorder , The flow system can also calculate accurate results , Ensure the original timing of events .

3. Support stateful Computing , And provide accurate one-time state consistency guarantee .

The so-called state is to save the intermediate result data of the operator in the memory or file system in the process of stream computing , After the next event enters the operator, the current result can be calculated from the intermediate result obtained from the previous state , Therefore, it is not necessary to count the results based on all the original data every time , This way greatly improves the performance of the system , And reduce the resource consumption of data calculation process .

4. Fault tolerance mechanism based on lightweight distributed snapshot .

Flink It can run on thousands of nodes in a distributed way , Break up the flow of a large calculation task into a small calculation process , And then Task Distributed to parallel nodes for processing . During the execution of the task , It can automatically discover the data inconsistency caused by errors in the event processing process , under these circumstances , Based on distributed snapshot technology Checkpoints, Persist the state information during execution , Once the task terminates abnormally ,Flink From the Checkpoints Automatic recovery of tasks in , To ensure consistency in data processing .

5. High availability is guaranteed , Dynamic expansion , Realization 7 * 24 24 hours a day .

Support high availability configuration （ No single point of failure ）, and Kubernetes、YARN、Apache Mesos Tight integration , Fast fault recovery , Dynamic expansion and contraction, etc . Based on the above characteristics , It can 7 X 24 Run streaming applications for hours , Almost no downtime . When dynamic update or quick recovery is required ,Flink adopt Savepoints Technology saves the snapshot of task execution on the storage media , When the task is restarted, you can directly engage in the first saved Savepoints Restore the original calculation state , Make the task continue to run as it was before the shutdown .

6. Support highly flexible window operation .

Flink Divide the window based on Time、Count、Session, as well as Data-driven And so on , Windows can be customized with flexible trigger conditions to support complex streaming modes , Users can define different window triggering mechanisms to meet different needs .

Application scenarios

In the actual production process , A lot of data is being produced , For example, financial transaction data 、 Internet order data 、GPS Location data 、 Sensor signal 、 Data generated by mobile terminals 、 Communication signal data, etc , As well as our familiar network traffic monitoring 、 Log data generated by the server , The biggest common ground of these data is that they are generated from different data sources in real time , Then it is transmitted to the downstream analysis system .

These data types mainly include the following scenarios ,Flink There is very good support for these scenarios .

1. Real time intelligent recommendation

utilize Flink Stream computing helps users build more real-time intelligent recommendation systems , Real time calculation of user behavior index , Update the model in real time , Real time prediction of user indicators , And push the predicted information to Web/App End , Help users get the information they want , On the other hand, it also helps enterprises increase sales , Create greater business value .

2. Complex event handling

For example, complex event processing in the industrial field , The data volume of these business types is very large , And it has high requirements for timeliness of data . We can use Flink Provided CEP（ Complex event handling ） Extract event patterns , Simultaneous application Flink Of SQL Transform event data , Build real-time rule engine in streaming system .

3. Real time fraud detection

In the business of Finance , There are all kinds of frauds . Application Flink Streaming computing technology can complete the calculation of fraud judgment behavior indicators in milliseconds , Then, the rule judgment or model prediction of transaction flow can be carried out in real time , In this way, once it is detected that there is a suspicion of fraud in the transaction , Then it directly intercepts the transaction in real time , Avoid economic loss caused by untimely handling

4. Real time data warehouse and ETL

Combined with offline data warehouse , By taking advantage of many advantages and SQL Flexible processing power , Real time cleaning of convection data 、 Merger 、 Structured processing , Supplement and optimize the offline data warehouse . On the other hand, combine real-time data ETL processing capacity , Using stateful flow computing technology , As far as possible, the complexity of scheduling logic in the process of off-line data computation can be reduced , Efficient and rapid processing of statistical results needed by enterprises , Help enterprises better apply the results of real-time data analysis .

5. Stream data analysis

Real time calculation of various data indicators , And use the real-time results to adjust the relevant strategies of the online system in time , Launch in various categories 、 There are a lot of applications in the field of wireless intelligent push . Streaming computing technology makes data analysis scene real-time , Help enterprises to do real-time analysis Web Application or App Various indicators applied .

6. Real time report analysis

Real time report analysis is one of the report statistics schemes adopted by many companies in recent years , One of the most important applications is real-time large screen display . The real-time results obtained by streaming computing are directly pushed to the front application , Real time display of important indicator changes , The most typical case is Taobao's "double 11" real-time war report .

Flink VS Spark Streaming

Data model
- Flink The basic data model is data flow , And the sequence of events .
- Spark use RDD Model ,Spark Streaming Of DStream In fact, it is a small batch
  data RDD Set .

Runtime schema
- Flink Is the standard flow execution mode , An event can be directly sent to the next section after being processed by a node
  Point for processing .
- Spark It's batch calculation , take DAG Divide into different Stage, The next calculation can be completed after .