当前位置:网站首页>Prometheus alarm process and related time parameter description
Prometheus alarm process and related time parameter description
2022-06-27 09:22:00 【upupfeng】
explain
use prometheus Do monitoring , There are many processes from the occurrence of an alarm event to the receipt of an alarm message , Understand the process and related time configuration , Can be more timely 、 Get alarm information efficiently .
Record below prometheus Alarm life cycle / technological process 、 Description of relevant configuration parameters and alarm cases .
prometheus Alarm life cycle / technological process
- prometheus Regularly collect index data
- prometheus Regularly calculate whether the indicator triggers rules
- The indicator alarm status of the triggering rule changes to pending, When the duration exceeds for After the specified time , Convert to firing, And send the alarm to alertmanager
- alertmanager After receiving the alarm , Wait for a grouping time , Send an alarm after the time ; If the packet continues to receive an alarm , Wait for a group alarm interval , Send an alarm for the packet again
- If the alarm persists ,alertmanager The alarm will be sent repeatedly according to the retransmission interval
The following picture shows the whole prometheus Process panorama of , Can clearly understand prometheus Alarm operation process of .
Time related parameters
| Parameter name | explain | The default value is | Parameter |
|---|---|---|---|
| scrape_interval | Index data collection interval | 1 minute | prometheus.yml |
| evaluation_interval | Regular calculation interval | 1 minute | prometheus.yml |
| for: Time | How long does the abnormality last to send an alarm | 0 | Rule configuration |
| group_wait | Group wait time . How long to wait for the first alarm to be sent in the same packet , The purpose is to send the same group of messages at the same time | 30 second | alertmanager.yml |
| group_interval | The interval between the upper and lower groups sending alarms . Wait after the first alarm group_interval Time , Start to trigger a new alarm for this group | 5 minute | alertmanager.yml |
| repeat_interval | Retransmission interval . The alarm has been sent , And there is no new alarm , The interval required to send the alarm again | 4 Hours | alertmanager.yml |
Case study
monitor Kafka Is the node down fall .
To configure
Index name :kakfa_up_status
1 Survive 0 Hang up
# prometheus.yml To configure
global:
scrape_interval: 20s
evaluation_interval: 30s
# Rule configuration
- alert: kakfa_down
expr: kakfa_up_status == 0
for: 1m
annotations:
summary: "Kafka Hang up "
# alertmanager To configure
route:
group_by: [alertname]
group_wait: 60s
group_interval: 5m
repeat_interval: 10m
Event flow
10:00:05 Kafka Hang up
10:00:20 Pull indicators kakfa_up_status=0
10:00:30 Calculation rules , Find out Kafka Hang up , take kakfa_down Set to pending
10:00:30~10:01:30 Continuously pull indicators 、 Calculation rules
10:01:30 kafka_down The duration reached 1 minute , Set to firing, Send to alertmanager
10:01:30 alertmanager After receipt of , Wait group wait time
10:02:30 Group wait time complete , Give an alarm
10:12:30 The alarm has not been resolved , Repeat the alarm
Reference resources
prometheus Alarm mechanism -( Why is the alarm not sent in time ) https://blog.csdn.net/luo4105/article/details/123700003
How soon can I receive prometheus Alarm of ? https://www.jianshu.com/p/b3b4e68409e0
prometheus The alarm group_wait&repeat_interval https://blog.csdn.net/tryyourbest0928/article/details/115337984
边栏推荐
- Rockermq message sending and consumption mode
- Parameters argc and argv of main()
- Markem imaje马肯依玛士喷码机维修9450E打码机维修
- The most direct manifestation of memory leak
- 为智能设备提供更强安全保护 科学家研发两种新方法
- 快捷键 bug,可复现(貌似 bug 才是需要的功能 [滑稽.gif])
- 招聘需求 视觉工程师
- How much do you know about the cause of amplifier distortion?
- Semi-supervised Learning入门学习——Π-Model、Temporal Ensembling、Mean Teacher简介
- Chapter 11 signal (I) - concept
猜你喜欢

Markem Imaje Marken IMAS printer maintenance 9450e printer maintenance

Understanding mvcc in MySQL transactions is super simple
Shortcut key bug, reproducible (it seems that bug is the required function [funny.Gif])

Understand neural network structure and optimization methods

RockerMQ消息发送与消费模式

How Oracle converts strings to multiple lines

ucore lab5

Analysis of orthofinder lineal homologous proteins and result processing

Getting started with webrtc: 12 Rtendpoint and webrtcendpoint under kurento

Matlab tips (18) matrix analysis -- entropy weight method
随机推荐
有關二叉樹的一些練習題
Vector:: data() pointer usage details
使线程释放锁资源的操作/方法重载一点注意事项
DV scroll board width of datav rotation table component
IO pin configuration and pinctrl drive
我大抵是卷上瘾了,横竖睡不着!竟让一个Bug,搞我两次!
经典的一道面试题,涵盖4个热点知识
巴基斯坦安全部队开展反恐行动 打死7名恐怖分子
1098 insertion or heap sort (PAT class a)
Hitek power supply maintenance X-ray machine high voltage generator maintenance xr150-603-02
Flow chart of Alipay wechat payment business
ucore lab4
静态代码块Vs构造代码块
main()的参数argc与argv
prometheus告警流程及相关时间参数说明
不会初始化类的几种Case
std::memory_order_seq_cst内存序
(original) custom drawable
微信小程序学习之五种页面跳转方法.
webrtc入门:12.Kurento下的RtpEndpoint和WebrtcEndpoint