当前位置：网站首页>Flink spark vs. Flink

Flink spark vs. Flink

2022-06-11 12:10:00 【But don't ask about your future】

List of articles

Preface

Preface

Apache Spark It is a general large-scale data analysis engine . Its memory computing concept allows us to learn from Hadoop Heavy MapReduce Get out of the program . In addition to fast calculation 、 High scalability ,Spark Also for batch processing （Spark SQL）、 Stream processing （Spark Streaming）、 machine learning （Spark MLlib）、 Figure calculation （Spark GraphX） It provides a unified distributed data processing platform , After years of vigorous development, the whole ecology has been very perfect .
and Flink In the field of real-time processing , Want to learn a big data processing framework , Choice in the end Spark, still Flink ？

1. Data processing architecture

Batch processing is for bounded data sets , It is very suitable for computing work that requires access to a large amount of all data , It is generally used for offline statistics .
Stream processing is mainly aimed at data streams , Characterized by unbounded 、 real time , Perform operations on each data transmitted by the system in turn , Generally used for real-time statistics .

Spark Based on batch processing , And try to support stream computing on a batch basis ; stay Spark In my world view , Offline data is a large batch , The real-time data is composed of infinite small batches . So for the stream processing framework Spark Streaming for , It's not really “ flow ” Handle , It is “ Micro batch ”（micro-batching） Handle
and Flink Think , Stream processing is the most basic operation , Batch processing can also be unified as stream processing . stay Flink In my world view , Real time data is standard 、 A stream without boundaries , Offline data is a bounded stream .

（1） Unbounded data flow （Unbounded Data Stream）
The so-called unbounded data flow , There is a head but no tail , The generation and transmission of data will begin but never end . So for unbounded data streams , Must be handled continuously , That is, it must be processed immediately after the data is obtained . When dealing with unbounded flows , In order to ensure the correctness of the results , It must be possible to process data in sequence .

（2） Bounded data flow （Bounded Data Stream）
Bounded data flows have well-defined beginning and end , So you can process bounded flows by getting all the data . There is no need to strictly guarantee the order of data when dealing with bounded flows , Because you can sort bounded data sets . The processing of bounded flows is also called batch processing
Insert picture description here

Generally speaking ,Spark There is always a way to synchronize based on micro batch processing “ Save batch ” The process of , So there will be extra expenses , Therefore, the low latency of stream processing cannot be maximized . In low latency stream processing scenarios ,Flink There are already obvious advantages . In the field of batch processing of massive data ,Spark Can handle more throughput , Plus its perfect ecology and mature and easy-to-use API, At present, the same advantages are obvious .

2. Data model and operational architecture

Spark and Flink The main difference in the underlying implementation is the data model ：

Spark The underlying data model is an elastic distributed data set （RDD）,Spark Streaming The underlying interface for micro batch processing DStream, In fact, it is also a small batch of data RDD Set . It can be seen that ,Spark The design itself is based on batch data sets , More suitable for batch processing scenarios .

and Flink The basic data model is data flow （DataFlow）, And events （Event） Sequence .Flink Basically in full accordance with Google Of DataFlow Model implementation , So from the bottom data model ,Flink It is designed to process streaming data , More suitable for the scene of stream processing .

The data model is different , Corresponding to the process of running processing , Naturally, there will be different architectures .Spark You need to set the task corresponding to DAG To divide into stages （Stage）, After one is finished, it goes through shuffle Then proceed to the next stage of calculation . and Flink Is the standard streaming execution mode , After an event is processed by one node, it can be directly sent to the next node for processing .

3. Spark still Flink？

If you need to learn from Spark and Flink Choose one of these two mainstream frameworks for real-time streaming , It is more recommended to use Flink, The main reasons are ：