当前位置:网站首页>谈谈对Flink框架中容错机制及状态的一致性的理解
谈谈对Flink框架中容错机制及状态的一致性的理解
2022-07-05 10:25:00 【InfoQ】
容错机制
- 重启应用
- 从 checkpoint 中读取状态,将状态重置
- 开始消费并处理检查点到发生故障之间的所有数据
public class Test{
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 检查点的配置
// 开启检查点周期为毫秒
env.enableCheckpointing(300);
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 设置超时时间
env.getCheckpointConfig().setCheckpointTimeout(60000);
// 设置最大执行的checkp个数
env.getCheckpointConfig().setMaxConcurrentCheckpoints(2);
// 设置最小的间歇时间,两个checkpoint之间,前一次完成到后一次开始之间的间歇不能小于XXX
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(100);
//env.getCheckpointConfig().setPreferCheckpointForRecovery();
// 容忍checkpoint失败多少次
env.getCheckpointConfig().setTolerableCheckpointFailureNumber(0);
// 重启策略的配置
// 固定延时重启 每隔10秒进行一次重启尝试重启三次
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,10000L));
// 失败率重启 在10分钟内统计重启,在10分钟内尝试三次重启,每次隔1分钟重启一次
env.setRestartStrategy(RestartStrategies.failureRateRestart(3, Time.minutes(10),Time.minutes(1)));
DataStreamSource<String> inputStream = env.socketTextStream("localhost", 7777);
DataStream<SensorReading> dataStream = inputStream.map(line -> {
String[] fields = line.split(",");
return new SensorReading(fields[0], new Long(fields[1]), new Double(fields[2]));
});
env.execute();
}
}
状态一致性
- AT-MOST-ONCE(最多一次)当任务故障时,最简单的做法是什么都不干,既不恢复丢失的状态,也不重播丢失的数据。At-most-once 语义的含义是最多处理一次事件。
- AT-LEAST-ONCE(至少一次)在大多数的真实应用场景,我们希望不丢失事件。这种类型的保障称为 atleast-once,意思是所有的事件都得到了处理,而一些事件还可能被处理多次。
- EXACTLY-ONCE(精确一次)恰好处理一次是最严格的保证,也是最难实现的。恰好处理一次语义不仅仅意味着没有事件丢失,还意味着针对每一个数据,内部状态仅仅更新一次。
@Public
public enum CheckpointingMode {
/**
* Sets the checkpointing mode to "exactly once". This mode means that the system will
* checkpoint the operator and user function state in such a way that, upon recovery,
* every record will be reflected exactly once in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will consistently be equal to the number of actual elements in the stream,
* regardless of failures and recovery.</p>
*
* <p>Note that this does not mean that each record flows through the streaming data flow
* only once. It means that upon recovery, the state of operators/functions is restored such
* that the resumed data streams pick up exactly at after the last modification to the state.</p>
*
* <p>Note that this mode does not guarantee exactly-once behavior in the interaction with
* external systems (only state in Flink's operators and user functions). The reason for that
* is that a certain level of "collaboration" is required between two systems to achieve
* exactly-once guarantees. However, for certain systems, connectors can be written that facilitate
* this collaboration.</p>
*
* <p>This mode sustains high throughput. Depending on the data flow graph and operations,
* this mode may increase the record latency, because operators need to align their input
* streams, in order to create a consistent snapshot point. The latency increase for simple
* dataflows (no repartitioning) is negligible. For simple dataflows with repartitioning, the average
* latency remains small, but the slowest records typically have an increased latency.</p>
*/
EXACTLY_ONCE,
/**
* Sets the checkpointing mode to "at least once". This mode means that the system will
* checkpoint the operator and user function state in a simpler way. Upon failure and recovery,
* some records may be reflected multiple times in the operator state.
*
* <p>For example, if a user function counts the number of elements in a stream,
* this number will equal to, or larger, than the actual number of elements in the stream,
* in the presence of failure and recovery.</p>
*
* <p>This mode has minimal impact on latency and may be preferable in very-low latency
* scenarios, where a sustained very-low latency (such as few milliseconds) is needed,
* and where occasional duplicate messages (on recovery) do not matter.</p>
*/
AT_LEAST_ONCE
}
- flink内部保证:依赖 checkpoint
- source 端:需要外部源可重设数据的读取位置
- sink 端:需要保证从故障恢复时,数据不会重复写入外部系统。两种方式:幂等写入、事务写入。
边栏推荐
- How can non-technical departments participate in Devops?
- Learning note 4 -- Key Technologies of high-precision map (Part 2)
- Golang应用专题 - channel
- What are the top ten securities companies? Is it safe to open an account online?
- 重磅:国产IDE发布,由阿里研发,完全开源!
- Blockbuster: the domestic IDE is released, developed by Alibaba, and is completely open source!
- WorkManager的学习二
- Learning II of workmanager
- LSTM应用于MNIST数据集分类(与CNN做对比)
- 爬虫(9) - Scrapy框架(1) | Scrapy 异步网络爬虫框架
猜你喜欢
Universal double button or single button pop-up
2022年T电梯修理操作证考试题及答案
[observation] with the rise of the "independent station" model of cross-border e-commerce, how to seize the next dividend explosion era?
Events and bubbles in the applet of "wechat applet - Basics"
[paper reading] ckan: collaborative knowledge aware autonomous network for adviser systems
Solution of ellipsis when pytorch outputs tensor (output tensor completely)
What is the origin of the domain knowledge network that drives the new idea of manufacturing industry upgrading?
Pseudo class elements -- before and after
Have you learned to make money in Dingding, enterprise micro and Feishu?
非技术部门,如何参与 DevOps?
随机推荐
Workmanager Learning one
WorkManager的学习二
非技術部門,如何參與 DevOps?
【tcp】服务器上tcp连接状态json形式输出
Learning note 4 -- Key Technologies of high-precision map (Part 2)
web安全
数组、、、
Coneroller执行时候的-26374及-26377错误
SLAM 01.人类识别环境&路径的模型建立
DOM//
【Vite】1371- 手把手开发 Vite 插件
How does redis implement multiple zones?
2022年危险化学品生产单位安全生产管理人员特种作业证考试题库模拟考试平台操作
想请教一下,十大券商有哪些?在线开户是安全么?
手机厂商“互卷”之年:“机海战术”失灵,“慢节奏”打法崛起
5G NR系统架构
Nuxt//
AtCoder Beginner Contest 258「ABCDEFG」
【js学习笔记五十四】BFC方式
沟通的艺术III:看人之间 之倾听