当前位置:网站首页>Talk about the understanding of fault tolerance mechanism and state consistency in Flink framework
Talk about the understanding of fault tolerance mechanism and state consistency in Flink framework
2022-07-05 10:41:00 【InfoQ】
Fault tolerance mechanism
- restart app
- from checkpoint Read status in , Reset the status
- Start consuming and processing all data from checkpoint to failure
public class Test{
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Configuration of checkpoints
// The checkpoint start cycle is milliseconds
env.enableCheckpointing(300);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// Set timeout
env.getCheckpointConfig().setCheckpointTimeout(60000);
// Set the maximum execution checkp Number
env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
// Set the minimum interval , Two checkpoint Between , The interval between the previous completion and the next start cannot be less than XXX
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(100);
//env.getCheckpointConfig().setPreferCheckpointForRecovery();
// tolerate checkpoint How many times have you failed
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(0);
// Configuration of restart policy
// Fixed delay restart every other 10 Seconds to restart three times
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,10000L));
// Failure rate restart stay 10 Count restarts within minutes , stay 10 Try to restart three times in minutes , Every septum 1 Minutes to restart
env.setRestartStrategy(RestartStrategies.failureRateRestart(3, Time.minutes(10),Time.minutes(1)));
DataStreamSource<String> inputStream = env.socketTextStream("localhost", 7777);
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
env.execute();
}
}
State consistency
- AT-MOST-ONCE( At most once ) When the mission fails , The easiest way is to do nothing , Neither restore the lost state , No replay of lost data .At-most-once The meaning of semantics is to handle an event at most once .
- AT-LEAST-ONCE( At least once ) In most real application scenarios , We hope not to lose Events . This type of protection is called atleast-once, It means that all the events have been dealt with , And some events can be handled multiple times .
- EXACTLY-ONCE( Exactly once ) Just once is the strictest guarantee , It's also the most difficult . Just dealing with semantics once doesn't just mean that no events are lost , It also means for every data , The internal state is only updated once .
@Public
public enum CheckpointingMode {
/**
* Sets the checkpointing mode to "exactly once". This mode means that the system will
* checkpoint the operator and user function state in such a way that, upon recovery,
* every record will be reflected exactly once in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will consistently be equal to the number of actual elements in the stream,
* regardless of failures and recovery.</p>
*
* <p>Note that this does not mean that each record flows through the streaming data flow
* only once. It means that upon recovery, the state of operators/functions is restored such
* that the resumed data streams pick up exactly at after the last modification to the state.</p>
*
* <p>Note that this mode does not guarantee exactly-once behavior in the interaction with
* external systems (only state in Flink's operators and user functions). The reason for that
* is that a certain level of "collaboration" is required between two systems to achieve
* exactly-once guarantees. However, for certain systems, connectors can be written that facilitate
* this collaboration.</p>
*
* <p>This mode sustains high throughput. Depending on the data flow graph and operations,
* this mode may increase the record latency, because operators need to align their input
* streams, in order to create a consistent snapshot point. The latency increase for simple
* dataflows (no repartitioning) is negligible. For simple dataflows with repartitioning, the average
* latency remains small, but the slowest records typically have an increased latency.</p>
*/
EXACTLY_ONCE,
/**
* Sets the checkpointing mode to "at least once". This mode means that the system will
* checkpoint the operator and user function state in a simpler way. Upon failure and recovery,
* some records may be reflected multiple times in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will equal to, or larger, than the actual number of elements in the stream,
* in the presence of failure and recovery.</p>
*
* <p>This mode has minimal impact on latency and may be preferable in very-low latency
* scenarios, where a sustained very-low latency (such as few milliseconds) is needed,
* and where occasional duplicate messages (on recovery) do not matter.</p>
*/
AT_LEAST_ONCE
}
- flink Internal assurance : rely on checkpoint
- source End : An external source is required to reset the read location of the data
- sink End : Need to ensure recovery from failure , Data is not repeatedly written to external systems . Two ways : Idempotent writing 、 Transaction write .

边栏推荐
- Activity jump encapsulation
- iframe
- How does redis implement multiple zones?
- [paper reading] kgat: knowledge graph attention network for recommendation
- AtCoder Beginner Contest 258「ABCDEFG」
- Learning II of workmanager
- 上拉加载原理
- 沟通的艺术III:看人之间 之倾听
- PWA (Progressive Web App)
- PHP solves the problems of cache avalanche, cache penetration and cache breakdown of redis
猜你喜欢
Go语言-1-开发环境配置
Learning Note 6 - satellite positioning technology (Part 1)
Redis如何实现多可用区?
[observation] with the rise of the "independent station" model of cross-border e-commerce, how to seize the next dividend explosion era?
[paper reading] kgat: knowledge graph attention network for recommendation
ModuleNotFoundError: No module named ‘scrapy‘ 终极解决方式
C语言实现QQ聊天室小项目 [完整源码]
谈谈对Flink框架中容错机制及状态的一致性的理解
微信核酸检测预约小程序系统毕业设计毕设(6)开题答辩PPT
2022年化工自动化控制仪表考试试题及在线模拟考试
随机推荐
Excerpt from "sword comes" (VII)
Solution of ellipsis when pytorch outputs tensor (output tensor completely)
Completion report of communication software development and Application
LDAP overview
DGL中异构图的一些理解以及异构图卷积HeteroGraphConv的用法
Learning note 4 -- Key Technologies of high-precision map (Part 2)
A large number of virtual anchors in station B were collectively forced to refund: revenue evaporated, but they still owe station B; Jobs was posthumously awarded the U.S. presidential medal of freedo
WorkManager学习一
Golang应用专题 - channel
5g NR system architecture
AD20 制作 Logo
Atcoder beginer contest 254 "e BFS" f st table maintenance differential array GCD "
Activity jump encapsulation
各位大佬,我测试起了3条线程同时往3个mysql表中写入,每条线程分别写入100000条数据,用了f
uniapp
2022年危险化学品经营单位主要负责人特种作业证考试题库及答案
SQL Server monitoring statistics blocking script information
2022鹏城杯web
Workmanager learning 1
正则表达式