当前位置:网站首页>Prometheus alarm process and related time parameter description
Prometheus alarm process and related time parameter description
2022-06-27 09:22:00 【upupfeng】
explain
use prometheus Do monitoring , There are many processes from the occurrence of an alarm event to the receipt of an alarm message , Understand the process and related time configuration , Can be more timely 、 Get alarm information efficiently .
Record below prometheus Alarm life cycle / technological process 、 Description of relevant configuration parameters and alarm cases .
prometheus Alarm life cycle / technological process
- prometheus Regularly collect index data
- prometheus Regularly calculate whether the indicator triggers rules
- The indicator alarm status of the triggering rule changes to pending, When the duration exceeds for After the specified time , Convert to firing, And send the alarm to alertmanager
- alertmanager After receiving the alarm , Wait for a grouping time , Send an alarm after the time ; If the packet continues to receive an alarm , Wait for a group alarm interval , Send an alarm for the packet again
- If the alarm persists ,alertmanager The alarm will be sent repeatedly according to the retransmission interval
The following picture shows the whole prometheus Process panorama of , Can clearly understand prometheus Alarm operation process of .
Time related parameters
| Parameter name | explain | The default value is | Parameter |
|---|---|---|---|
| scrape_interval | Index data collection interval | 1 minute | prometheus.yml |
| evaluation_interval | Regular calculation interval | 1 minute | prometheus.yml |
| for: Time | How long does the abnormality last to send an alarm | 0 | Rule configuration |
| group_wait | Group wait time . How long to wait for the first alarm to be sent in the same packet , The purpose is to send the same group of messages at the same time | 30 second | alertmanager.yml |
| group_interval | The interval between the upper and lower groups sending alarms . Wait after the first alarm group_interval Time , Start to trigger a new alarm for this group | 5 minute | alertmanager.yml |
| repeat_interval | Retransmission interval . The alarm has been sent , And there is no new alarm , The interval required to send the alarm again | 4 Hours | alertmanager.yml |
Case study
monitor Kafka Is the node down fall .
To configure
Index name :kakfa_up_status
1 Survive 0 Hang up
# prometheus.yml To configure
global:
scrape_interval: 20s
evaluation_interval: 30s
# Rule configuration
- alert: kakfa_down
expr: kakfa_up_status == 0
for: 1m
annotations:
summary: "Kafka Hang up "
# alertmanager To configure
route:
group_by: [alertname]
group_wait: 60s
group_interval: 5m
repeat_interval: 10m
Event flow
10:00:05 Kafka Hang up
10:00:20 Pull indicators kakfa_up_status=0
10:00:30 Calculation rules , Find out Kafka Hang up , take kakfa_down Set to pending
10:00:30~10:01:30 Continuously pull indicators 、 Calculation rules
10:01:30 kafka_down The duration reached 1 minute , Set to firing, Send to alertmanager
10:01:30 alertmanager After receipt of , Wait group wait time
10:02:30 Group wait time complete , Give an alarm
10:12:30 The alarm has not been resolved , Repeat the alarm
Reference resources
prometheus Alarm mechanism -( Why is the alarm not sent in time ) https://blog.csdn.net/luo4105/article/details/123700003
How soon can I receive prometheus Alarm of ? https://www.jianshu.com/p/b3b4e68409e0
prometheus The alarm group_wait&repeat_interval https://blog.csdn.net/tryyourbest0928/article/details/115337984
边栏推荐
- Today's three interviews demo[integer ASCII class relationship]
- Collection framework generic LinkedList TreeSet
- std::memory_ order_ seq_ CST memory order
- CLassLoader
- Analysis of orthofinder lineal homologous proteins and result processing
- 高等数学第七章微分方程
- 1098 insertion or heap sort (PAT class a)
- 快速入门CherryPy(1)
- This, constructor, static, and inter call must be understood!
- I'm almost addicted to it. I can't sleep! Let a bug fuck me twice!
猜你喜欢

Take you to play with the camera module

Source insight 工具安装及使用方法

微信小程序学习之五种页面跳转方法.

VIM from dislike to dependence (20) -- global command

Rough reading DS transunet: dual swing transformer u-net for medical image segmentation

Object含有Copy方法?

HiTek电源维修X光机高压发生器维修XR150-603-02

Markem imaje马肯依玛士喷码机维修9450E打码机维修

Markem Imaje Marken IMAS printer maintenance 9450e printer maintenance

E+h secondary meter repair pH transmitter secondary display repair cpm253-mr0005
随机推荐
Five page Jump methods for wechat applet learning
ThreadLocal digs its knowledge points again
SVN版本控制器的安装及使用方法
Markem imaje马肯依玛士喷码机维修9450E打码机维修
[ 扩散模型(Diffusion Model) ]
快速入门CherryPy(1)
ucore lab5
CLassLoader
CLassLoader
【mysql篇-基础篇】通用语法1
i=i++;
不容置疑,这是一个绝对精心制作的项目
vim 从嫌弃到依赖(19)——替换
Win10 add right-click menu for any file
Take you to play with the camera module
Improving efficiency or increasing costs, how should developers understand pair programming?
Quick start CherryPy (1)
Some exercises about binary tree
Collection framework generic LinkedList TreeSet
Analysis of key technologies for live broadcast pain points -- second opening, clarity and fluency of the first frame