当前位置:网站首页>Talk about the understanding of fault tolerance mechanism and state consistency in Flink framework
Talk about the understanding of fault tolerance mechanism and state consistency in Flink framework
2022-07-05 10:41:00 【InfoQ】
Fault tolerance mechanism
- restart app
- from checkpoint Read status in , Reset the status
- Start consuming and processing all data from checkpoint to failure
public class Test{
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Configuration of checkpoints
// The checkpoint start cycle is milliseconds
env.enableCheckpointing(300);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// Set timeout
env.getCheckpointConfig().setCheckpointTimeout(60000);
// Set the maximum execution checkp Number
env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
// Set the minimum interval , Two checkpoint Between , The interval between the previous completion and the next start cannot be less than XXX
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(100);
//env.getCheckpointConfig().setPreferCheckpointForRecovery();
// tolerate checkpoint How many times have you failed
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(0);
// Configuration of restart policy
// Fixed delay restart every other 10 Seconds to restart three times
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,10000L));
// Failure rate restart stay 10 Count restarts within minutes , stay 10 Try to restart three times in minutes , Every septum 1 Minutes to restart
env.setRestartStrategy(RestartStrategies.failureRateRestart(3, Time.minutes(10),Time.minutes(1)));
DataStreamSource<String> inputStream = env.socketTextStream("localhost", 7777);
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
env.execute();
}
}
State consistency
- AT-MOST-ONCE( At most once ) When the mission fails , The easiest way is to do nothing , Neither restore the lost state , No replay of lost data .At-most-once The meaning of semantics is to handle an event at most once .
- AT-LEAST-ONCE( At least once ) In most real application scenarios , We hope not to lose Events . This type of protection is called atleast-once, It means that all the events have been dealt with , And some events can be handled multiple times .
- EXACTLY-ONCE( Exactly once ) Just once is the strictest guarantee , It's also the most difficult . Just dealing with semantics once doesn't just mean that no events are lost , It also means for every data , The internal state is only updated once .
@Public
public enum CheckpointingMode {
/**
* Sets the checkpointing mode to "exactly once". This mode means that the system will
* checkpoint the operator and user function state in such a way that, upon recovery,
* every record will be reflected exactly once in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will consistently be equal to the number of actual elements in the stream,
* regardless of failures and recovery.</p>
*
* <p>Note that this does not mean that each record flows through the streaming data flow
* only once. It means that upon recovery, the state of operators/functions is restored such
* that the resumed data streams pick up exactly at after the last modification to the state.</p>
*
* <p>Note that this mode does not guarantee exactly-once behavior in the interaction with
* external systems (only state in Flink's operators and user functions). The reason for that
* is that a certain level of "collaboration" is required between two systems to achieve
* exactly-once guarantees. However, for certain systems, connectors can be written that facilitate
* this collaboration.</p>
*
* <p>This mode sustains high throughput. Depending on the data flow graph and operations,
* this mode may increase the record latency, because operators need to align their input
* streams, in order to create a consistent snapshot point. The latency increase for simple
* dataflows (no repartitioning) is negligible. For simple dataflows with repartitioning, the average
* latency remains small, but the slowest records typically have an increased latency.</p>
*/
EXACTLY_ONCE,
/**
* Sets the checkpointing mode to "at least once". This mode means that the system will
* checkpoint the operator and user function state in a simpler way. Upon failure and recovery,
* some records may be reflected multiple times in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will equal to, or larger, than the actual number of elements in the stream,
* in the presence of failure and recovery.</p>
*
* <p>This mode has minimal impact on latency and may be preferable in very-low latency
* scenarios, where a sustained very-low latency (such as few milliseconds) is needed,
* and where occasional duplicate messages (on recovery) do not matter.</p>
*/
AT_LEAST_ONCE
}
- flink Internal assurance : rely on checkpoint
- source End : An external source is required to reset the read location of the data
- sink End : Need to ensure recovery from failure , Data is not repeatedly written to external systems . Two ways : Idempotent writing 、 Transaction write .
边栏推荐
猜你喜欢
[paper reading] ckan: collaborative knowledge aware autonomous network for adviser systems
【DNS】“Can‘t resolve host“ as non-root user, but works fine as root
Solution of ellipsis when pytorch outputs tensor (output tensor completely)
AD20 制作 Logo
In the year of "mutual entanglement" of mobile phone manufacturers, the "machine sea tactics" failed, and the "slow pace" playing method rose
Secteur non technique, comment participer à devops?
AtCoder Beginner Contest 254「E bfs」「F st表维护差分数组gcd」
[observation] with the rise of the "independent station" model of cross-border e-commerce, how to seize the next dividend explosion era?
Who is the "conscience" domestic brand?
WorkManager學習一
随机推荐
请问postgresql cdc 怎么设置单独的增量模式呀,debezium.snapshot.mo
非技術部門,如何參與 DevOps?
App各大应用商店/应用市场网址汇总
Learning Note 6 - satellite positioning technology (Part 1)
Activity enter exit animation
爬虫(9) - Scrapy框架(1) | Scrapy 异步网络爬虫框架
2022年T电梯修理操作证考试题及答案
【JS】提取字符串中的分数,汇总后算出平均分,并与每个分数比较,输出
SAP UI5 ObjectPageLayout 控件使用方法分享
赛克瑞浦动力电池首台产品正式下线
在C# 中实现上升沿,并模仿PLC环境验证 If 语句使用上升沿和不使用上升沿的不同
How to write high-quality code?
中职组网络安全C模块全漏洞脚本讲解包含4个漏洞的脚本
请问大佬们 有遇到过flink cdc mongdb 执行flinksql 遇到这样的问题的么?
变量///
Go-3-第一个Go程序
AtCoder Beginner Contest 258「ABCDEFG」
2022年流动式起重机司机考试题库及模拟考试
Atcoder beginer contest 254 "e BFS" f st table maintenance differential array GCD "
pytorch输出tensor张量时有省略号的解决方案(将tensor完整输出)