当前位置:网站首页>Yyds dry goods inventory Prometheus alarm Art
Yyds dry goods inventory Prometheus alarm Art
2022-07-03 22:16:00 【key_ 3_ feng】
Alarm is an important part of the whole monitoring system , stay Prometheus In the monitoring system , The acquisition and storage of indicators are separated from alarms . The alarm rule is Prometheus server End defined , After the alarm rule is triggered , Will send information to independent components Alertmanager On , After processing the alarm , Finally through the receiver ( Such as Email) Inform the user of .
In the use of Prometheus When warning , The first question is , When should an alarm be issued ? One A good warning is at the right time 、 For the right reason 、 Send at the right pace , And put real and useful information in it . The time when we give an alarm must be when an important system or service is abnormal , Lead to poor end-user experience or inability to use the service normally , And need manual inspection . The potential pain points of end users should be alarmed , In order to keep the number of alarms small , So as to prevent excessive monitoring of the system or service . Keep the number of small and important alarms , Is a key principle of alarm , That is to say, the alarm should be important 、 Operable and real .
The second is what the alarm should contain ? For online services that provide content to end users , It is usually possible to send alarms under conditions such as high delay and error rate ; For capacity suggestion alarm , When the capacity is exhausted should be detected , This causes a shutdown ; For batch jobs , Send an alarm when the operation is not successful recently . Of course Everyone should set the alarm content to adapt to their own application environment , Especially the important online business interface content . Alarms should be linked to the relevant instrument cluster and console , The information in these dashboards and consoles can answer basic questions about the service being alerted , So that the on call administrator can quickly interpret the potential problems .
Prometheus server And Alertmanager It's two separate components . We use Prometheus server Collect various monitoring indicators , Then based on PromQL Define threshold alarm rules for these indicators (Rules).Prometheus server Calculate the alarm rules periodically , If the alarm triggering conditions are met , An alarm message is generated , And push it to Alertmanager Components . After receiving the alarm information ,Alertmanager Can handle alarms , Grouping (grouping) And route them (routing) To the right receiver (receiver), Such as Email、PagerDuty and HipChat etc. , Finally, the notification of abnormal events is sent to the receiver .
Grouping mechanism (Grouping) Refer to ,AlertManager Group alarms of the same type , Merge multiple alarms into one notification . In the real world , Especially when there is dense coupling between business lines in the cloud computing environment , If more than one device goes down , It may cause hundreds of alarms to be triggered . In this case, use the grouping mechanism , These triggered alarms can be combined into one alarm for notification , So as to avoid suddenly receiving a large number of alarm notifications , This makes it impossible for the administrator to quickly locate the problem .
Alertmanager Inhibition mechanism of (Inhibition) Refer to , When an alarm has been sent , Stop sending the alarm mechanism of other abnormalities or faults caused by this alarm repeatedly . In the production environment , for example IDC In the managed cabinet , If each cabinet access layer is only a single switch , Then the failure of the cabinet access switch will cause the servers in the cabinet to be non UP Status alert ; In addition, if the application deployed on the server is inaccessible, the alarm will also be triggered . here , You can configure the Alertmanager Ignore the alarm caused by the inaccessibility of all servers and their applications in the cabinet caused by switch failure .
Silence (Silences) Provides a simple mechanism , The alarm can be quickly processed silently according to the tag . Check the matching of the incoming alarm , If the received alarm conforms to the silent configuration ,Alertmanager No alarm notification will be sent . Administrators can work directly in Alertmanager Of Web Temporarily mask the specified alarm notification in the interface .
notice alertmanager Configuration file formats usually include global( Global configuration )、templates( Alarm template )、route( Alert routing )、receivers( Receiver ) and inhibit_rules( Inhibition rules ) And other main configuration item modules .
- global That is, global configuration , stay Alertmanager In profile , As long as the options configured in the global configuration item are public settings , It can be used as the default value of other configuration items , It can also be overwritten by the settings in other configuration items .
- route Alert routing The module describes when received Prometheus server After the generated alarm , Send the alarm to receiver Rules for specified destination addresses .
- receivers Receiver Is a general designation , Every receiver You need to set a globally unique name , And corresponds to one or more notification methods , Including email 、 WeChat 、PagerDuty、HipChat and Webhook etc. .
- inhibit_rule modular Set in to realize the alarm suppression function , We can specify the alarm conditions to be ignored under specific conditions . You can use this option to set preferences , For example, give priority to some alarms , If the alarms in the same group occur at the same time , Then ignore other alarms .
Prometheus Start by collecting information about monitoring targets , To trigger an alarm .
1) Define the rules .
stay Prometheus In profile , To configure scrape_interval:15s( The default value is 1min) Collection cycle for collecting monitoring target information , And configure the corresponding alarm rules .scrape_interval It can be a global setting , It can also be single metric Definition .
2) Period calculation .
When calculating the corresponding expression , stay Prometheus Configuration in profile evaluation_interval:15s( The default value is 1min) Calculate the period for the alarm rule ,evaluation_interval Just calculate the cycle value globally .
3) Alarm state transition .·
- When the alarm rule condition is found to be true for the first time , That is, the expression is true , And it does not meet the alarm rules for When the duration specified in Clause , The alarm state is switched to PENDING.
- If in the next calculation cycle , The expression is still true , And meet the alarm rules for When the duration specified in Clause , The alarm state changes to FIRING, That is to say active, Alarm is Prometheus Send to Alertmanager Components .
- If the next calculation cycle , The expression is still true , And meet the alarm rules for When the duration specified in Clause , Continuously send alarms to Alertmanager Components .
- Until a certain calculation period , Expression is false , The alarm status will change to inactive, And there will be one resolve Sent to Altermanger, Used to indicate that this alarm has been resolved .
边栏推荐
- Code in keil5 -- use the code formatting tool astyle (plug-in)
- Leetcode problem solving - 230 The k-th smallest element in the binary search tree
- 3 environment construction -standalone
- Blue Bridge Cup Guoxin Changtian MCU -- program download (III)
- 国泰君安证券开户是安全可靠的么?怎么开国泰君安证券账户
- Development mode and Prospect of China's IT training industry strategic planning trend report Ⓣ 2022 ~ 2028
- Investment planning analysis and prospect prediction report of China's satellite application industry during the 14th five year plan Ⓑ 2022 ~ 2028
- Décompiler et modifier un exe ou une DLL non source en utilisant dnspy
- JS notes (III)
- The 14th five year plan for the construction of Chinese Enterprise Universities and the feasibility study report on investment Ⓓ 2022 ~ 2028
猜你喜欢

Cesium terrain clipping draw polygon clipping

2022 free examination questions for safety management personnel of hazardous chemical business units and reexamination examination for safety management personnel of hazardous chemical business units

4 environment construction -standalone ha

Yyds dry goods inventory Spring Festival "make" your own fireworks

3 environment construction -standalone

Yyds dry inventory hcie security Day12: concept of supplementary package filtering and security policy
![[dynamic planning] counting garlic customers: the log of garlic King (the longest increasing public subsequence)](/img/29/543dce2f24130d22c1824385fbfa8f.jpg)
[dynamic planning] counting garlic customers: the log of garlic King (the longest increasing public subsequence)

Buuctf, misc: n solutions

How PHP gets all method names of objects

IPhone development swift foundation 09 assets
随机推荐
[sg function] 2021 Niuke winter vacation training camp 6 h. winter messenger 2
Is the account opening of Guotai Junan Securities safe and reliable? How to open Guotai Junan Securities Account
Introduction to kubernetes
Analysis report on the development prospect and investment strategy of global and Chinese modular automation systems Ⓟ 2022 ~ 2027
Summary of basic knowledge of exception handling
[sg function] lightoj Partitioning Game
English topic assignment (28)
Mysql database - Advanced SQL statement (I)
How to obtain opensea data through opensea JS
gslb(global server load balance)技術的一點理解
Code in keil5 -- use the code formatting tool astyle (plug-in)
js demo 計算本年度還剩下多少天
Yyds dry goods inventory Spring Festival "make" your own fireworks
Nacos common configuration
Rest参考
Buuctf, web:[geek challenge 2019] buyflag
Teach you how to install aidlux (1 installation)
2022 electrician (elementary) examination questions and electrician (elementary) registration examination
China's TPMS industry demand forecast and future development trend analysis report Ⓐ 2022 ~ 2028
Development mode and Prospect of China's IT training industry strategic planning trend report Ⓣ 2022 ~ 2028