当前位置:网站首页>Yyds dry goods inventory Prometheus alarm Art
Yyds dry goods inventory Prometheus alarm Art
2022-07-03 22:16:00 【key_ 3_ feng】
Alarm is an important part of the whole monitoring system , stay Prometheus In the monitoring system , The acquisition and storage of indicators are separated from alarms . The alarm rule is Prometheus server End defined , After the alarm rule is triggered , Will send information to independent components Alertmanager On , After processing the alarm , Finally through the receiver ( Such as Email) Inform the user of .
In the use of Prometheus When warning , The first question is , When should an alarm be issued ? One A good warning is at the right time 、 For the right reason 、 Send at the right pace , And put real and useful information in it . The time when we give an alarm must be when an important system or service is abnormal , Lead to poor end-user experience or inability to use the service normally , And need manual inspection . The potential pain points of end users should be alarmed , In order to keep the number of alarms small , So as to prevent excessive monitoring of the system or service . Keep the number of small and important alarms , Is a key principle of alarm , That is to say, the alarm should be important 、 Operable and real .
The second is what the alarm should contain ? For online services that provide content to end users , It is usually possible to send alarms under conditions such as high delay and error rate ; For capacity suggestion alarm , When the capacity is exhausted should be detected , This causes a shutdown ; For batch jobs , Send an alarm when the operation is not successful recently . Of course Everyone should set the alarm content to adapt to their own application environment , Especially the important online business interface content . Alarms should be linked to the relevant instrument cluster and console , The information in these dashboards and consoles can answer basic questions about the service being alerted , So that the on call administrator can quickly interpret the potential problems .
Prometheus server And Alertmanager It's two separate components . We use Prometheus server Collect various monitoring indicators , Then based on PromQL Define threshold alarm rules for these indicators (Rules).Prometheus server Calculate the alarm rules periodically , If the alarm triggering conditions are met , An alarm message is generated , And push it to Alertmanager Components . After receiving the alarm information ,Alertmanager Can handle alarms , Grouping (grouping) And route them (routing) To the right receiver (receiver), Such as Email、PagerDuty and HipChat etc. , Finally, the notification of abnormal events is sent to the receiver .
Grouping mechanism (Grouping) Refer to ,AlertManager Group alarms of the same type , Merge multiple alarms into one notification . In the real world , Especially when there is dense coupling between business lines in the cloud computing environment , If more than one device goes down , It may cause hundreds of alarms to be triggered . In this case, use the grouping mechanism , These triggered alarms can be combined into one alarm for notification , So as to avoid suddenly receiving a large number of alarm notifications , This makes it impossible for the administrator to quickly locate the problem .
Alertmanager Inhibition mechanism of (Inhibition) Refer to , When an alarm has been sent , Stop sending the alarm mechanism of other abnormalities or faults caused by this alarm repeatedly . In the production environment , for example IDC In the managed cabinet , If each cabinet access layer is only a single switch , Then the failure of the cabinet access switch will cause the servers in the cabinet to be non UP Status alert ; In addition, if the application deployed on the server is inaccessible, the alarm will also be triggered . here , You can configure the Alertmanager Ignore the alarm caused by the inaccessibility of all servers and their applications in the cabinet caused by switch failure .
Silence (Silences) Provides a simple mechanism , The alarm can be quickly processed silently according to the tag . Check the matching of the incoming alarm , If the received alarm conforms to the silent configuration ,Alertmanager No alarm notification will be sent . Administrators can work directly in Alertmanager Of Web Temporarily mask the specified alarm notification in the interface .
notice alertmanager Configuration file formats usually include global( Global configuration )、templates( Alarm template )、route( Alert routing )、receivers( Receiver ) and inhibit_rules( Inhibition rules ) And other main configuration item modules .
- global That is, global configuration , stay Alertmanager In profile , As long as the options configured in the global configuration item are public settings , It can be used as the default value of other configuration items , It can also be overwritten by the settings in other configuration items .
- route Alert routing The module describes when received Prometheus server After the generated alarm , Send the alarm to receiver Rules for specified destination addresses .
- receivers Receiver Is a general designation , Every receiver You need to set a globally unique name , And corresponds to one or more notification methods , Including email 、 WeChat 、PagerDuty、HipChat and Webhook etc. .
- inhibit_rule modular Set in to realize the alarm suppression function , We can specify the alarm conditions to be ignored under specific conditions . You can use this option to set preferences , For example, give priority to some alarms , If the alarms in the same group occur at the same time , Then ignore other alarms .
Prometheus Start by collecting information about monitoring targets , To trigger an alarm .
1) Define the rules .
stay Prometheus In profile , To configure scrape_interval:15s( The default value is 1min) Collection cycle for collecting monitoring target information , And configure the corresponding alarm rules .scrape_interval It can be a global setting , It can also be single metric Definition .
2) Period calculation .
When calculating the corresponding expression , stay Prometheus Configuration in profile evaluation_interval:15s( The default value is 1min) Calculate the period for the alarm rule ,evaluation_interval Just calculate the cycle value globally .
3) Alarm state transition .·
- When the alarm rule condition is found to be true for the first time , That is, the expression is true , And it does not meet the alarm rules for When the duration specified in Clause , The alarm state is switched to PENDING.
- If in the next calculation cycle , The expression is still true , And meet the alarm rules for When the duration specified in Clause , The alarm state changes to FIRING, That is to say active, Alarm is Prometheus Send to Alertmanager Components .
- If the next calculation cycle , The expression is still true , And meet the alarm rules for When the duration specified in Clause , Continuously send alarms to Alertmanager Components .
- Until a certain calculation period , Expression is false , The alarm status will change to inactive, And there will be one resolve Sent to Altermanger, Used to indicate that this alarm has been resolved .
边栏推荐
- Base ring tree Cartesian tree
- JS notes (III)
- Leetcode problem solving - 230 The k-th smallest element in the binary search tree
- The White House held an open source security summit, attended by many technology giants
- Common SQL sets
- Conditional statements of shell programming
- An expression that regularly matches one of two strings
- SDNU_ ACM_ ICPC_ 2022_ Winter_ Practice_ 4th [individual]
- 4 environment construction -standalone ha
- Decompile and modify the non source exe or DLL with dnspy
猜你喜欢
[golang] leetcode intermediate - alphabetic combination of island number and phone number
1 Introduction to spark Foundation
Go Technology Daily (2022-02-13) - Summary of experience in database storage selection
2022 free examination questions for safety management personnel of hazardous chemical business units and reexamination examination for safety management personnel of hazardous chemical business units
How PHP gets all method names of objects
Redis single thread and multi thread
STM32 multi serial port implementation of printf -- Based on cubemx
常用sql集合
Blue Bridge Cup Guoxin Changtian MCU -- program download (III)
Pooling idea: string constant pool, thread pool, database connection pool
随机推荐
Dahua series books
IPhone development swift foundation 09 assets
China's Call Center Industry 14th five year plan direction and operation analysis report Ⓔ 2022 ~ 2028
2 spark environment setup local
2022 free examination questions for safety management personnel of hazardous chemical business units and reexamination examination for safety management personnel of hazardous chemical business units
[template summary] - binary search tree BST - Basics
Yyds dry inventory hcie security Day12: concept of supplementary package filtering and security policy
Are the top ten securities companies safe to open accounts and register? Is there any risk?
1068. Consolidation of ring stones (ring, interval DP)
English topic assignment (28)
Intimacy communication -- [repair relationship] - use communication to heal injuries
STM32 multi serial port implementation of printf -- Based on cubemx
string
Farmersworld farmers world, no faith, how to talk about success?
Mysql database - Advanced SQL statement (I)
320. Energy Necklace (ring, interval DP)
Team collaborative combat penetration tool CS artifact cobalt strike
Netfilter ARP log
DR-NAS26-Qualcomm-Atheros-AR9582-2T-2R-MIMO-802.11-N-5GHz-high-power-Mini-PCIe-Wi-Fi-Module
WFC900M-Network_ Card/Qualcomm-Atheros-AR9582-2T-2R-MIMO-802.11-N-900M-high-power-Mini-PCIe-Wi-Fi-Mod