当前位置:网站首页>Flink CheckPoint : Exceeded checkpoint tolerable failure threshold

Flink CheckPoint : Exceeded checkpoint tolerable failure threshold

2022-06-12 08:53:00 //Continuous margin_ documentary

One 、 Problem description

The checkpoint tolerable failure threshold has been exceeded
 Insert picture description here

 Insert picture description here

Two 、 Solution steps

1、 Check checkpoint Set up

obvious ,checkpoint It's overtime , therefore , I subconsciously go , Check checkpoint Set up
The settings in the code are as follows :

		//  Every time  ** ms  Start once  checkpoint
        env.enableCheckpointing(10*1000);
        //  Set the mode to precise once  ( This is the default )
        env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE);
        //  confirm  checkpoints  The time between will be  ** ms
        env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
        // Checkpoint  It has to be done in a minute , Otherwise, they will be abandoned 
        env.getCheckpointConfig().setCheckpointTimeout(60000);
        //  Only one is allowed at a time  checkpoint  Conduct 
        env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
        //  Open in  job  What remains after suspension  externalized checkpoints
        env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        //  Allow for closer  savepoint  Back to  checkpoint
        env.getCheckpointConfig().setPreferCheckpointForRecovery(true);

Try changing timeout Time , from 1 Change the minute to 10 minute , Repackage online .
Then check it out UI Interface , Find out checkpoint Still can't work normally , The state has always been IN_PROGRESS, No progress , Just wait 1 Minutes become 10 minute , The program finally hung up
 Insert picture description here
This is the time , Consider not checkpoint Problems setting up the , But the program has bug, Resources are not released or other problems , Cause the program to get stuck , So much so that checkpoint Overtime .

2、 Check processing logic

 Insert picture description here
Data channel blocking found , After printing data, it is found that , Asynchronous in task IO from HBase Query data in , Yes key non-existent , Associated task timed out , Lead to checkpoint Failure
 Insert picture description here
Print dimension association timeout data :
 Insert picture description here

3、 The problem is repeated

The cause of the problem :hbase scan Poor performance , This causes the query of dimension data to time out , Failed to create checkpoint
Normally , Dimension query will not time out without corresponding data , Just return a null value , however scan The whole scan takes a long time to query , So use get Way to accurately query .

3、 ... and 、 Solution

hbase There are only two ways to implement the query :
According to the specified rowkey Gets a unique record :get Method .
Obtain a batch of records according to the specified conditions :scan Method .
 Insert picture description here

原网站

版权声明
本文为[//Continuous margin_ documentary]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/163/202206120851146134.html