当前位置:网站首页>谈谈对Flink框架中容错机制及状态的一致性的理解
谈谈对Flink框架中容错机制及状态的一致性的理解
2022-07-05 10:25:00 【InfoQ】
容错机制
- 重启应用
- 从 checkpoint 中读取状态,将状态重置
- 开始消费并处理检查点到发生故障之间的所有数据
public class Test{
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 检查点的配置
// 开启检查点周期为毫秒
env.enableCheckpointing(300);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 设置超时时间
env.getCheckpointConfig().setCheckpointTimeout(60000);
// 设置最大执行的checkp个数
env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
// 设置最小的间歇时间,两个checkpoint之间,前一次完成到后一次开始之间的间歇不能小于XXX
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(100);
//env.getCheckpointConfig().setPreferCheckpointForRecovery();
// 容忍checkpoint失败多少次
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(0);
// 重启策略的配置
// 固定延时重启 每隔10秒进行一次重启尝试重启三次
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,10000L));
// 失败率重启 在10分钟内统计重启,在10分钟内尝试三次重启,每次隔1分钟重启一次
env.setRestartStrategy(RestartStrategies.failureRateRestart(3, Time.minutes(10),Time.minutes(1)));
DataStreamSource<String> inputStream = env.socketTextStream("localhost", 7777);
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
env.execute();
}
}
状态一致性
- AT-MOST-ONCE(最多一次)当任务故障时,最简单的做法是什么都不干,既不恢复丢失的状态,也不重播丢失的数据。At-most-once 语义的含义是最多处理一次事件。
- AT-LEAST-ONCE(至少一次)在大多数的真实应用场景,我们希望不丢失事件。这种类型的保障称为 atleast-once,意思是所有的事件都得到了处理,而一些事件还可能被处理多次。
- EXACTLY-ONCE(精确一次)恰好处理一次是最严格的保证,也是最难实现的。恰好处理一次语义不仅仅意味着没有事件丢失,还意味着针对每一个数据,内部状态仅仅更新一次。
@Public
public enum CheckpointingMode {
/**
* Sets the checkpointing mode to "exactly once". This mode means that the system will
* checkpoint the operator and user function state in such a way that, upon recovery,
* every record will be reflected exactly once in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will consistently be equal to the number of actual elements in the stream,
* regardless of failures and recovery.</p>
*
* <p>Note that this does not mean that each record flows through the streaming data flow
* only once. It means that upon recovery, the state of operators/functions is restored such
* that the resumed data streams pick up exactly at after the last modification to the state.</p>
*
* <p>Note that this mode does not guarantee exactly-once behavior in the interaction with
* external systems (only state in Flink's operators and user functions). The reason for that
* is that a certain level of "collaboration" is required between two systems to achieve
* exactly-once guarantees. However, for certain systems, connectors can be written that facilitate
* this collaboration.</p>
*
* <p>This mode sustains high throughput. Depending on the data flow graph and operations,
* this mode may increase the record latency, because operators need to align their input
* streams, in order to create a consistent snapshot point. The latency increase for simple
* dataflows (no repartitioning) is negligible. For simple dataflows with repartitioning, the average
* latency remains small, but the slowest records typically have an increased latency.</p>
*/
EXACTLY_ONCE,
/**
* Sets the checkpointing mode to "at least once". This mode means that the system will
* checkpoint the operator and user function state in a simpler way. Upon failure and recovery,
* some records may be reflected multiple times in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will equal to, or larger, than the actual number of elements in the stream,
* in the presence of failure and recovery.</p>
*
* <p>This mode has minimal impact on latency and may be preferable in very-low latency
* scenarios, where a sustained very-low latency (such as few milliseconds) is needed,
* and where occasional duplicate messages (on recovery) do not matter.</p>
*/
AT_LEAST_ONCE
}
- flink内部保证:依赖 checkpoint
- source 端:需要外部源可重设数据的读取位置
- sink 端:需要保证从故障恢复时,数据不会重复写入外部系统。两种方式:幂等写入、事务写入。

边栏推荐
- App各大应用商店/应用市场网址汇总
- 《通信软件开发与应用》课程结业报告
- 【JS】数组降维
- 字符串、、
- web安全
- Learning Note 6 - satellite positioning technology (Part 1)
- A large number of virtual anchors in station B were collectively forced to refund: revenue evaporated, but they still owe station B; Jobs was posthumously awarded the U.S. presidential medal of freedo
- SQL Server 监控统计阻塞脚本信息
- SAP ui5 objectpagelayout control usage sharing
- 2022年化工自动化控制仪表考试试题及在线模拟考试
猜你喜欢

5G NR系统架构

风控模型启用前的最后一道工序,80%的童鞋在这都踩坑

LSTM应用于MNIST数据集分类(与CNN做对比)
![C language QQ chat room small project [complete source code]](/img/4e/b3703ac864830d55c824e1b56c8f85.png)
C language QQ chat room small project [complete source code]
![[vite] 1371 - develop vite plug-ins by hand](/img/7f/84bba39965b4116f20b1cf8211f70a.png)
[vite] 1371 - develop vite plug-ins by hand

2022年化工自动化控制仪表考试试题及在线模拟考试

爬虫(9) - Scrapy框架(1) | Scrapy 异步网络爬虫框架

微信核酸检测预约小程序系统毕业设计毕设(7)中期检查报告

Redis如何实现多可用区?

Solution of ellipsis when pytorch outputs tensor (output tensor completely)
随机推荐
Redis如何实现多可用区?
Should the dependency given by the official website be Flink SQL connector MySQL CDC, with dependency added
> Could not create task ‘:app:MyTest. main()‘. > SourceSet with name ‘main‘ not found. Problem repair
Flink CDC cannot monitor MySQL logs. Have you ever encountered this problem?
TypeError: Cannot read properties of undefined (reading ‘cancelToken‘)
AtCoder Beginner Contest 254「E bfs」「F st表维护差分数组gcd」
websocket
Blockbuster: the domestic IDE is released, developed by Alibaba, and is completely open source!
How do programmers live as they like?
2022年危险化学品经营单位主要负责人特种作业证考试题库及答案
埋点111
"Everyday Mathematics" serial 58: February 27
分享.NET 轻量级的ORM
Solution to the length of flex4 and Flex3 combox drop-down box
MFC宠物商店信息管理系统
C#实现获取DevExpress中GridView表格进行过滤或排序后的数据
【SWT组件】内容滚动组件 ScrolledComposite
Learning notes 5 - high precision map solution
“军备竞赛”时期的对比学习
[论文阅读] CKAN: Collaborative Knowledge-aware Atentive Network for Recommender Systems