当前位置:网站首页>Yyds dry goods inventory Prometheus alarm Art
Yyds dry goods inventory Prometheus alarm Art
2022-07-03 22:16:00 【key_ 3_ feng】
Alarm is an important part of the whole monitoring system , stay Prometheus In the monitoring system , The acquisition and storage of indicators are separated from alarms . The alarm rule is Prometheus server End defined , After the alarm rule is triggered , Will send information to independent components Alertmanager On , After processing the alarm , Finally through the receiver ( Such as Email) Inform the user of .
In the use of Prometheus When warning , The first question is , When should an alarm be issued ? One A good warning is at the right time 、 For the right reason 、 Send at the right pace , And put real and useful information in it . The time when we give an alarm must be when an important system or service is abnormal , Lead to poor end-user experience or inability to use the service normally , And need manual inspection . The potential pain points of end users should be alarmed , In order to keep the number of alarms small , So as to prevent excessive monitoring of the system or service . Keep the number of small and important alarms , Is a key principle of alarm , That is to say, the alarm should be important 、 Operable and real .
The second is what the alarm should contain ? For online services that provide content to end users , It is usually possible to send alarms under conditions such as high delay and error rate ; For capacity suggestion alarm , When the capacity is exhausted should be detected , This causes a shutdown ; For batch jobs , Send an alarm when the operation is not successful recently . Of course Everyone should set the alarm content to adapt to their own application environment , Especially the important online business interface content . Alarms should be linked to the relevant instrument cluster and console , The information in these dashboards and consoles can answer basic questions about the service being alerted , So that the on call administrator can quickly interpret the potential problems .
Prometheus server And Alertmanager It's two separate components . We use Prometheus server Collect various monitoring indicators , Then based on PromQL Define threshold alarm rules for these indicators (Rules).Prometheus server Calculate the alarm rules periodically , If the alarm triggering conditions are met , An alarm message is generated , And push it to Alertmanager Components . After receiving the alarm information ,Alertmanager Can handle alarms , Grouping (grouping) And route them (routing) To the right receiver (receiver), Such as Email、PagerDuty and HipChat etc. , Finally, the notification of abnormal events is sent to the receiver .
Grouping mechanism (Grouping) Refer to ,AlertManager Group alarms of the same type , Merge multiple alarms into one notification . In the real world , Especially when there is dense coupling between business lines in the cloud computing environment , If more than one device goes down , It may cause hundreds of alarms to be triggered . In this case, use the grouping mechanism , These triggered alarms can be combined into one alarm for notification , So as to avoid suddenly receiving a large number of alarm notifications , This makes it impossible for the administrator to quickly locate the problem .
Alertmanager Inhibition mechanism of (Inhibition) Refer to , When an alarm has been sent , Stop sending the alarm mechanism of other abnormalities or faults caused by this alarm repeatedly . In the production environment , for example IDC In the managed cabinet , If each cabinet access layer is only a single switch , Then the failure of the cabinet access switch will cause the servers in the cabinet to be non UP Status alert ; In addition, if the application deployed on the server is inaccessible, the alarm will also be triggered . here , You can configure the Alertmanager Ignore the alarm caused by the inaccessibility of all servers and their applications in the cabinet caused by switch failure .
Silence (Silences) Provides a simple mechanism , The alarm can be quickly processed silently according to the tag . Check the matching of the incoming alarm , If the received alarm conforms to the silent configuration ,Alertmanager No alarm notification will be sent . Administrators can work directly in Alertmanager Of Web Temporarily mask the specified alarm notification in the interface .
notice alertmanager Configuration file formats usually include global( Global configuration )、templates( Alarm template )、route( Alert routing )、receivers( Receiver ) and inhibit_rules( Inhibition rules ) And other main configuration item modules .
- global That is, global configuration , stay Alertmanager In profile , As long as the options configured in the global configuration item are public settings , It can be used as the default value of other configuration items , It can also be overwritten by the settings in other configuration items .
- route Alert routing The module describes when received Prometheus server After the generated alarm , Send the alarm to receiver Rules for specified destination addresses .
- receivers Receiver Is a general designation , Every receiver You need to set a globally unique name , And corresponds to one or more notification methods , Including email 、 WeChat 、PagerDuty、HipChat and Webhook etc. .
- inhibit_rule modular Set in to realize the alarm suppression function , We can specify the alarm conditions to be ignored under specific conditions . You can use this option to set preferences , For example, give priority to some alarms , If the alarms in the same group occur at the same time , Then ignore other alarms .
Prometheus Start by collecting information about monitoring targets , To trigger an alarm .
1) Define the rules .
stay Prometheus In profile , To configure scrape_interval:15s( The default value is 1min) Collection cycle for collecting monitoring target information , And configure the corresponding alarm rules .scrape_interval It can be a global setting , It can also be single metric Definition .
2) Period calculation .
When calculating the corresponding expression , stay Prometheus Configuration in profile evaluation_interval:15s( The default value is 1min) Calculate the period for the alarm rule ,evaluation_interval Just calculate the cycle value globally .
3) Alarm state transition .·
- When the alarm rule condition is found to be true for the first time , That is, the expression is true , And it does not meet the alarm rules for When the duration specified in Clause , The alarm state is switched to PENDING.
- If in the next calculation cycle , The expression is still true , And meet the alarm rules for When the duration specified in Clause , The alarm state changes to FIRING, That is to say active, Alarm is Prometheus Send to Alertmanager Components .
- If the next calculation cycle , The expression is still true , And meet the alarm rules for When the duration specified in Clause , Continuously send alarms to Alertmanager Components .
- Until a certain calculation period , Expression is false , The alarm status will change to inactive, And there will be one resolve Sent to Altermanger, Used to indicate that this alarm has been resolved .
边栏推荐
- 油猴插件
- 2022 safety officer-b certificate examination summary and safety officer-b certificate simulation test questions
- The latest analysis of R1 quick opening pressure vessel operation in 2022 and the examination question bank of R1 quick opening pressure vessel operation
- This time, thoroughly understand bidirectional data binding 01
- [dynamic planning] counting garlic customers: the log of garlic King (the longest increasing public subsequence)
- (POJ - 2912) rochambau (weighted concurrent search + enumeration)
- Buuctf, misc: sniffed traffic
- IPhone development swift foundation 09 assets
- Bluebridge cup Guoxin Changtian single chip microcomputer -- detailed explanation of schematic diagram (IV)
- Kali2021.4a build PWN environment
猜你喜欢
2022 free examination questions for safety management personnel of hazardous chemical business units and reexamination examination for safety management personnel of hazardous chemical business units
Common SQL sets
Redis concludes that the second pipeline publishes / subscribes to bloom filter redis as a database and caches RDB AOF redis configuration files
Go Technology Daily (2022-02-13) - Summary of experience in database storage selection
BUUCTF,Misc:LSB
The latest analysis of crane driver (limited to bridge crane) in 2022 and the test questions and analysis of crane driver (limited to bridge crane)
Décompiler et modifier un exe ou une DLL non source en utilisant dnspy
Exclusive interview with the person in charge of openkruise: to what extent has cloud native application automation developed now?
Blue Bridge Cup Guoxin Changtian single chip microcomputer -- software environment (II)
Asynchronous artifact: implementation principle and usage scenario of completable future
随机推荐
4. Data splitting of Flink real-time project
(POJ - 2912) rochambau (weighted concurrent search + enumeration)
Sed、Awk
Tkinter Huarong Road 4x4 tutorial III
2022 safety officer-a certificate registration examination and summary of safety officer-a certificate examination
DR-NAS26-Qualcomm-Atheros-AR9582-2T-2R-MIMO-802.11-N-5GHz-high-power-Mini-PCIe-Wi-Fi-Module
油猴插件
Dynamic research and future planning analysis report of China's urban water supply industry Ⓝ 2022 ~ 2028
Analysis report on the development prospect and investment strategy of global and Chinese modular automation systems Ⓟ 2022 ~ 2027
2 spark environment setup local
Mysql database - Advanced SQL statement (I)
Rest reference
Team collaborative combat penetration tool CS artifact cobalt strike
Nacos common configuration
Data consistency between redis and database
Common SQL sets
Unique in China! Alibaba cloud container service enters the Forrester leader quadrant
Blue Bridge Cup Guoxin Changtian MCU -- program download (III)
Summary of basic knowledge of exception handling
Blue Bridge Cup Guoxin Changtian single chip microcomputer -- software environment (II)