当前位置:网站首页>谈谈对Flink框架中容错机制及状态的一致性的理解
谈谈对Flink框架中容错机制及状态的一致性的理解
2022-07-05 10:25:00 【InfoQ】
容错机制
- 重启应用
- 从 checkpoint 中读取状态,将状态重置
- 开始消费并处理检查点到发生故障之间的所有数据
public class Test{
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 检查点的配置
// 开启检查点周期为毫秒
env.enableCheckpointing(300);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 设置超时时间
env.getCheckpointConfig().setCheckpointTimeout(60000);
// 设置最大执行的checkp个数
env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
// 设置最小的间歇时间,两个checkpoint之间,前一次完成到后一次开始之间的间歇不能小于XXX
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(100);
//env.getCheckpointConfig().setPreferCheckpointForRecovery();
// 容忍checkpoint失败多少次
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(0);
// 重启策略的配置
// 固定延时重启 每隔10秒进行一次重启尝试重启三次
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,10000L));
// 失败率重启 在10分钟内统计重启,在10分钟内尝试三次重启,每次隔1分钟重启一次
env.setRestartStrategy(RestartStrategies.failureRateRestart(3, Time.minutes(10),Time.minutes(1)));
DataStreamSource<String> inputStream = env.socketTextStream("localhost", 7777);
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
env.execute();
}
}
状态一致性
- AT-MOST-ONCE(最多一次)当任务故障时,最简单的做法是什么都不干,既不恢复丢失的状态,也不重播丢失的数据。At-most-once 语义的含义是最多处理一次事件。
- AT-LEAST-ONCE(至少一次)在大多数的真实应用场景,我们希望不丢失事件。这种类型的保障称为 atleast-once,意思是所有的事件都得到了处理,而一些事件还可能被处理多次。
- EXACTLY-ONCE(精确一次)恰好处理一次是最严格的保证,也是最难实现的。恰好处理一次语义不仅仅意味着没有事件丢失,还意味着针对每一个数据,内部状态仅仅更新一次。
@Public
public enum CheckpointingMode {
/**
* Sets the checkpointing mode to "exactly once". This mode means that the system will
* checkpoint the operator and user function state in such a way that, upon recovery,
* every record will be reflected exactly once in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will consistently be equal to the number of actual elements in the stream,
* regardless of failures and recovery.</p>
*
* <p>Note that this does not mean that each record flows through the streaming data flow
* only once. It means that upon recovery, the state of operators/functions is restored such
* that the resumed data streams pick up exactly at after the last modification to the state.</p>
*
* <p>Note that this mode does not guarantee exactly-once behavior in the interaction with
* external systems (only state in Flink's operators and user functions). The reason for that
* is that a certain level of "collaboration" is required between two systems to achieve
* exactly-once guarantees. However, for certain systems, connectors can be written that facilitate
* this collaboration.</p>
*
* <p>This mode sustains high throughput. Depending on the data flow graph and operations,
* this mode may increase the record latency, because operators need to align their input
* streams, in order to create a consistent snapshot point. The latency increase for simple
* dataflows (no repartitioning) is negligible. For simple dataflows with repartitioning, the average
* latency remains small, but the slowest records typically have an increased latency.</p>
*/
EXACTLY_ONCE,
/**
* Sets the checkpointing mode to "at least once". This mode means that the system will
* checkpoint the operator and user function state in a simpler way. Upon failure and recovery,
* some records may be reflected multiple times in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will equal to, or larger, than the actual number of elements in the stream,
* in the presence of failure and recovery.</p>
*
* <p>This mode has minimal impact on latency and may be preferable in very-low latency
* scenarios, where a sustained very-low latency (such as few milliseconds) is needed,
* and where occasional duplicate messages (on recovery) do not matter.</p>
*/
AT_LEAST_ONCE
}
- flink内部保证:依赖 checkpoint
- source 端:需要外部源可重设数据的读取位置
- sink 端:需要保证从故障恢复时,数据不会重复写入外部系统。两种方式:幂等写入、事务写入。

边栏推荐
- 伪类元素--before和after
- 沟通的艺术III:看人之间 之倾听
- TypeError: Cannot read properties of undefined (reading ‘cancelToken‘)
- Timed disappearance pop-up
- C function returns multiple value methods
- Learning note 4 -- Key Technologies of high-precision map (Part 2)
- [dark horse morning post] Luo Yonghao responded to ridicule Oriental selection; Dong Qing's husband Mi Chunlei was executed for more than 700million; Geely officially acquired Meizu; Huawei releases M
- TSQL–标示列、GUID 、序列
- Golang应用专题 - channel
- Completion report of communication software development and Application
猜你喜欢

SAP UI5 ObjectPageLayout 控件使用方法分享

AtCoder Beginner Contest 258「ABCDEFG」
![[vite] 1371 - develop vite plug-ins by hand](/img/7f/84bba39965b4116f20b1cf8211f70a.png)
[vite] 1371 - develop vite plug-ins by hand

2022年危险化学品生产单位安全生产管理人员特种作业证考试题库模拟考试平台操作

Learning II of workmanager

2022年T电梯修理操作证考试题及答案

微信核酸检测预约小程序系统毕业设计毕设(7)中期检查报告

How to plan the career of a programmer?

WorkManager學習一

@Serializedname annotation use
随机推荐
Events and bubbles in the applet of "wechat applet - Basics"
Web Components
想请教一下,十大券商有哪些?在线开户是安全么?
埋点111
Secteur non technique, comment participer à devops?
一个可以兼容各种数据库事务的使用范例
微信小程序中,从一个页面跳转到另一个页面后,在返回后发现页面同步滚动了
Learning Note 6 - satellite positioning technology (Part 1)
pytorch输出tensor张量时有省略号的解决方案(将tensor完整输出)
九度 1480:最大上升子序列和(动态规划思想求最值)
2022年危险化学品生产单位安全生产管理人员特种作业证考试题库模拟考试平台操作
What is the most suitable book for programmers to engage in open source?
How did automated specification inspection software develop?
Learning note 4 -- Key Technologies of high-precision map (Part 2)
QT implements JSON parsing
Learning II of workmanager
beego跨域问题解决方案-亲试成功
2022年化工自动化控制仪表考试试题及在线模拟考试
C语言实现QQ聊天室小项目 [完整源码]
flink cdc不能监听mysql日志,大家遇到过这个问题吧?