当前位置:网站首页>Flink learning 8: data consistency
Flink learning 8: data consistency
2022-07-04 04:18:00 【hzp666】
1. brief introduction
In the distributed stream processing engine , High throughput Low latency , Is the core requirement .
At the same time, data consistency is also very important in distributed applications .
( In a precise scene , Accuracy and consistency are often required )
2.flink Data consistency
flink How to ensure the consistency of calculation state .
Asynchronous barrier snapshot mechanism , To achieve accurate data consistency .
When the task crashes or is canceled , You can use checkpoints or savepoints , To achieve recovery , Realize the replay of data flow , So as to achieve the consistency of tasks .( This mechanism will not sacrifice system performance )
2.1 Stateful and stateless events
Let's first look at what is a state event :
1. No state , That is, each event is independent , There is no correlation between events .
The output result is only related to the current event .
eg: Statistical weather temperature , When more than 40°C When , Issue high temperature alarm .( It has nothing to do with the previous temperature )
2. A stateful , That is, the event is related to the previous event state .
The output is a combination of previous events , Results of comprehensive consideration
eg: Count the recent 1 Hourly average temperature ,
2.2 Data consistency
When distributed systems introduce state , Naturally, the problem of data consistency is introduced .
According to the different correctness , Can be divided into 3 class :
1. The correctness is the lowest : At most once . When the fault occurs , Don't do anything? .
2. Medium accuracy : At least once . When the fault occurs , The system will not miss previous events , But the calculation may be repeated .( The final statistical value may be greater than or equal to the real data value )
3. The highest accuracy : Exactly the same . The aggregation result is consistent with the result without failure .
“” Exactly the same “” relative “” At least once “”, The system will be more complex , The processing speed will be relatively slow . Because there will be data alignment .
At the very beginning storm,samza At least once ,
Later, Storm Trident and Spark Streaming Although the accuracy and consistency are guaranteed , But at the expense of a lot of performance .
Flink Without sacrificing too much performance , Ensure accuracy once .
2.3 Flink Asynchronous barrier snapshot mechanism
2.3.1 Snapshot mechanism
First, let's see what the snapshot mechanism is : Record job status and data flow regularly
2.3.2 But the traditional snapshot mechanism , There are two main problems :
2.3.3 flink How to optimize the snapshot mechanism
1. Adopt asynchronous snapshot mechanism . be based on chandy-lamport Algorithm , A checkpoint mechanism has been developed , It is called asynchronous barrier checkpoint mechanism .
2. Asynchronous barrier snapshot mechanism
3. Checkpoint barrier , Is a special kind of internal message ,
Divide the data flow into multiple windows in time ,
One window corresponds to , A snapshot in the data stream .
Barrier by JobManager Broadcast to all computing tasks regularly source, And flow downstream with the data flow .
Each barrier is located at , Current snapshot and The split point of the next snapshot .
When the downstream data check the voucher , The snapshot action will be triggered , There is no need to pause this computing task .
4. In asynchronous checkpoints “ asynchronous ”
边栏推荐
- Two sides of the evening: tell me about the bloom filter and cuckoo filter? Application scenario? I'm confused..
- User defined path and file name of Baidu editor in laravel admin
- idea修改主体颜色
- AAAI2022 | Word Embeddings via Causal Inference: Gender Bias Reducing and Semantic Information Preserving
- 2021 RSC | Drug–target affinity prediction using graph neural network and contact maps
- 毕业三年,远程半年 | 社区征文
- Katalon framework tests web (XXI) to obtain element attribute assertions
- Introduction to asynchronous task capability of function calculation - task trigger de duplication
- Storage of MySQL database
- Global exposure and roller shutter exposure of industrial cameras
猜你喜欢
*. No main manifest attribute in jar
Flink学习6:编程模型
AAAI2022 | Word Embeddings via Causal Inference: Gender Bias Reducing and Semantic Information Preserving
laravel admin里百度编辑器自定义路径和文件名
SQL語句加强練習(MySQL8.0為例)
The three-year revenue is 3.531 billion, and this Jiangxi old watch is going to IPO
I was tortured by my colleague's null pointer for a long time, and finally learned how to deal with null pointer
Katalon framework test web (XXVI) automatic email
毕业设计:设计秒杀电商系统
如何远程办公更有效率 | 社区征文
随机推荐
2020 Bioinformatics | TransformerCPI
vue多级路由嵌套怎么动态缓存组件
Redis cluster view the slots of each node
Database SQL statement summary, continuous update
The difference between bagging and boosting in machine learning
Flink学习8:数据的一致性
CesiumJS 2022^ 源码解读[0] - 文章目录与源码工程结构
毕业总结
量子力学习题
Cesiumjs 2022^ source code interpretation [0] - article directory and source code engineering structure
Introduction to asynchronous task capability of function calculation - task trigger de duplication
Katalon framework tests web (XXI) to obtain element attribute assertions
STM32 external DHT11 display temperature and humidity
Flink学习7:应用程序结构
【微服务|openfeign】feign的两种降级方式|Fallback|FallbackFactory
Small record of thinking
idea修改主体颜色
软件测试是干什么的 发现缺陷错误,提高软件的质量
Detailed explanation of PPTC self recovery fuse
STM32外接DHT11显示温湿度