当前位置:网站首页>Prometheus alarm process and related time parameter description
Prometheus alarm process and related time parameter description
2022-06-27 09:22:00 【upupfeng】
explain
use prometheus Do monitoring , There are many processes from the occurrence of an alarm event to the receipt of an alarm message , Understand the process and related time configuration , Can be more timely 、 Get alarm information efficiently .
Record below prometheus Alarm life cycle / technological process 、 Description of relevant configuration parameters and alarm cases .
prometheus Alarm life cycle / technological process
- prometheus Regularly collect index data
- prometheus Regularly calculate whether the indicator triggers rules
- The indicator alarm status of the triggering rule changes to pending, When the duration exceeds for After the specified time , Convert to firing, And send the alarm to alertmanager
- alertmanager After receiving the alarm , Wait for a grouping time , Send an alarm after the time ; If the packet continues to receive an alarm , Wait for a group alarm interval , Send an alarm for the packet again
- If the alarm persists ,alertmanager The alarm will be sent repeatedly according to the retransmission interval
The following picture shows the whole prometheus Process panorama of , Can clearly understand prometheus Alarm operation process of .
Time related parameters
| Parameter name | explain | The default value is | Parameter |
|---|---|---|---|
| scrape_interval | Index data collection interval | 1 minute | prometheus.yml |
| evaluation_interval | Regular calculation interval | 1 minute | prometheus.yml |
| for: Time | How long does the abnormality last to send an alarm | 0 | Rule configuration |
| group_wait | Group wait time . How long to wait for the first alarm to be sent in the same packet , The purpose is to send the same group of messages at the same time | 30 second | alertmanager.yml |
| group_interval | The interval between the upper and lower groups sending alarms . Wait after the first alarm group_interval Time , Start to trigger a new alarm for this group | 5 minute | alertmanager.yml |
| repeat_interval | Retransmission interval . The alarm has been sent , And there is no new alarm , The interval required to send the alarm again | 4 Hours | alertmanager.yml |
Case study
monitor Kafka Is the node down fall .
To configure
Index name :kakfa_up_status
1 Survive 0 Hang up
# prometheus.yml To configure
global:
scrape_interval: 20s
evaluation_interval: 30s
# Rule configuration
- alert: kakfa_down
expr: kakfa_up_status == 0
for: 1m
annotations:
summary: "Kafka Hang up "
# alertmanager To configure
route:
group_by: [alertname]
group_wait: 60s
group_interval: 5m
repeat_interval: 10m
Event flow
10:00:05 Kafka Hang up
10:00:20 Pull indicators kakfa_up_status=0
10:00:30 Calculation rules , Find out Kafka Hang up , take kakfa_down Set to pending
10:00:30~10:01:30 Continuously pull indicators 、 Calculation rules
10:01:30 kafka_down The duration reached 1 minute , Set to firing, Send to alertmanager
10:01:30 alertmanager After receipt of , Wait group wait time
10:02:30 Group wait time complete , Give an alarm
10:12:30 The alarm has not been resolved , Repeat the alarm
Reference resources
prometheus Alarm mechanism -( Why is the alarm not sent in time ) https://blog.csdn.net/luo4105/article/details/123700003
How soon can I receive prometheus Alarm of ? https://www.jianshu.com/p/b3b4e68409e0
prometheus The alarm group_wait&repeat_interval https://blog.csdn.net/tryyourbest0928/article/details/115337984
边栏推荐
- VIM from dislike to dependence (20) -- global command
- Enumeration? Constructor? Interview demo
- Some exercises about binary tree
- Preliminary understanding of pytorch
- Summary of three basic interview questions
- std::memory_ order_ seq_ CST memory order
- I'm almost addicted to it. I can't sleep! Let a bug fuck me twice!
- E+H二次表维修PH变送器二次显示仪修理CPM253-MR0005
- 【生动理解】深度学习中常用的各项评价指标含义TP、FP、TN、FN、IoU、Accuracy
- Apache POI的读写
猜你喜欢

如何获取GC(垃圾回收器)的STW(暂停)时间?
![[system design] proximity service](/img/02/57f9ded0435a73f86dce6eb8c16382.png)
[system design] proximity service

Markem Imaje Marken IMAS printer maintenance 9450e printer maintenance

Matlab tips (19) matrix analysis -- principal component analysis

Object contains copy method?

How do I get the STW (pause) time of a GC (garbage collector)?

VIM from dislike to dependence (20) -- global command

Advanced mathematics Chapter 7 differential equations
![[cloud native] 2.3 kubernetes core practice (Part 1)](/img/f8/dbd2546e775625d5c98881e7745047.png)
[cloud native] 2.3 kubernetes core practice (Part 1)

I'm almost addicted to it. I can't sleep! Let a bug fuck me twice!
随机推荐
有关二叉树的一些练习题
Enumeration? Constructor? Interview demo
I'm almost addicted to it. I can't sleep! Let a bug fuck me twice!
经典的一道面试题,涵盖4个热点知识
Markem imaje马肯依玛士喷码机维修9450E打码机维修
Object contains copy method?
Five page Jump methods for wechat applet learning
Win10 add right-click menu for any file
Analysis log log
About the problem that the El date picker Click to clear the parameter and make it null
JS EventListener
内部类~锁~访问修饰符
快捷键 bug,可复现(貌似 bug 才是需要的功能 [滑稽.gif])
vector::data() 指针使用细节
ucore lab3
内存泄露的最直接表现
MYSQL精通-01 增删改
招聘需求 视觉工程师
Order by injection of SQL injection
[original] typescript string UTF-8 encoding and decoding