当前位置:网站首页>Flink learning 8: data consistency
Flink learning 8: data consistency
2022-07-04 04:18:00 【hzp666】
1. brief introduction
In the distributed stream processing engine , High throughput Low latency , Is the core requirement .
At the same time, data consistency is also very important in distributed applications .
( In a precise scene , Accuracy and consistency are often required )
2.flink Data consistency
flink How to ensure the consistency of calculation state .
Asynchronous barrier snapshot mechanism , To achieve accurate data consistency .
When the task crashes or is canceled , You can use checkpoints or savepoints , To achieve recovery , Realize the replay of data flow , So as to achieve the consistency of tasks .( This mechanism will not sacrifice system performance )
2.1 Stateful and stateless events
Let's first look at what is a state event :
1. No state , That is, each event is independent , There is no correlation between events .
The output result is only related to the current event .
eg: Statistical weather temperature , When more than 40°C When , Issue high temperature alarm .( It has nothing to do with the previous temperature )
2. A stateful , That is, the event is related to the previous event state .
The output is a combination of previous events , Results of comprehensive consideration
eg: Count the recent 1 Hourly average temperature ,
2.2 Data consistency
When distributed systems introduce state , Naturally, the problem of data consistency is introduced .
According to the different correctness , Can be divided into 3 class :
1. The correctness is the lowest : At most once . When the fault occurs , Don't do anything? .
2. Medium accuracy : At least once . When the fault occurs , The system will not miss previous events , But the calculation may be repeated .( The final statistical value may be greater than or equal to the real data value )
3. The highest accuracy : Exactly the same . The aggregation result is consistent with the result without failure .
“” Exactly the same “” relative “” At least once “”, The system will be more complex , The processing speed will be relatively slow . Because there will be data alignment .
At the very beginning storm,samza At least once ,
Later, Storm Trident and Spark Streaming Although the accuracy and consistency are guaranteed , But at the expense of a lot of performance .
Flink Without sacrificing too much performance , Ensure accuracy once .
2.3 Flink Asynchronous barrier snapshot mechanism
2.3.1 Snapshot mechanism
First, let's see what the snapshot mechanism is : Record job status and data flow regularly
2.3.2 But the traditional snapshot mechanism , There are two main problems :
2.3.3 flink How to optimize the snapshot mechanism
1. Adopt asynchronous snapshot mechanism . be based on chandy-lamport Algorithm , A checkpoint mechanism has been developed , It is called asynchronous barrier checkpoint mechanism .
2. Asynchronous barrier snapshot mechanism
3. Checkpoint barrier , Is a special kind of internal message ,
Divide the data flow into multiple windows in time ,
One window corresponds to , A snapshot in the data stream .
Barrier by JobManager Broadcast to all computing tasks regularly source, And flow downstream with the data flow .
Each barrier is located at , Current snapshot and The split point of the next snapshot .
When the downstream data check the voucher , The snapshot action will be triggered , There is no need to pause this computing task .
4. In asynchronous checkpoints “ asynchronous ”
边栏推荐
- Three years of graduation, half a year of distance | community essay solicitation
- The difference between bagging and boosting in machine learning
- 指针数组和数组指针
- [webrtc] M98 Ninja build and compile instructions
- 01 QEMU starts the compiled image vfs: unable to mount root FS on unknown block (0,0)
- User defined path and file name of Baidu editor in laravel admin
- Perf simple process for multithreaded profile
- Katalon uses script to query list size
- laravel admin里百度编辑器自定义路径和文件名
- Common methods of threads
猜你喜欢
随机推荐
Redis cluster view the slots of each node
TCP-三次握手和四次挥手简单理解
如何有效远程办公之我见 | 社区征文
Wechat official account web page authorization
量子力学习题
How to telecommute more efficiently | community essay solicitation
Redis cluster uses Lua script. Lua script can also be used for different slots
Tcpclientdemo for TCP protocol interaction
pytest多进程/多线程执行测试用例
图解网络:什么是热备份路由器协议HSRP?
[paddleseg source code reading] paddleseg custom data class
Is it safe to buy insurance for your children online? Do you want to buy a million dollar medical insurance for your children?
Select sorting and bubble sorting template
Flink learning 7: application structure
Graduation summary
Pandora IOT development board learning (HAL Library) - Experiment 6 independent watchdog experiment (learning notes)
leetcode刷题:二叉树04(二叉树的层序遍历)
vim正确加区间注释
Database SQL statement summary, continuous update
透过JVM-SANDBOX源码,了解字节码增强技术原理