当前位置：网站首页>Yyds dry goods inventory Prometheus alarm Art

Yyds dry goods inventory Prometheus alarm Art

2022-07-03 22:16:00 【key_ 3_ feng】

Alarm is an important part of the whole monitoring system , stay Prometheus In the monitoring system , The acquisition and storage of indicators are separated from alarms . The alarm rule is Prometheus server End defined , After the alarm rule is triggered , Will send information to independent components Alertmanager On , After processing the alarm , Finally through the receiver （ Such as Email） Inform the user of .

In the use of Prometheus When warning , The first question is , When should an alarm be issued ？ One A good warning is at the right time 、 For the right reason 、 Send at the right pace , And put real and useful information in it . The time when we give an alarm must be when an important system or service is abnormal , Lead to poor end-user experience or inability to use the service normally , And need manual inspection . The potential pain points of end users should be alarmed , In order to keep the number of alarms small , So as to prevent excessive monitoring of the system or service . Keep the number of small and important alarms , Is a key principle of alarm , That is to say, the alarm should be important 、 Operable and real .

The second is what the alarm should contain ？ For online services that provide content to end users , It is usually possible to send alarms under conditions such as high delay and error rate ; For capacity suggestion alarm , When the capacity is exhausted should be detected , This causes a shutdown ; For batch jobs , Send an alarm when the operation is not successful recently . Of course Everyone should set the alarm content to adapt to their own application environment , Especially the important online business interface content . Alarms should be linked to the relevant instrument cluster and console , The information in these dashboards and consoles can answer basic questions about the service being alerted , So that the on call administrator can quickly interpret the potential problems .

Prometheus server And Alertmanager It's two separate components . We use Prometheus server Collect various monitoring indicators , Then based on PromQL Define threshold alarm rules for these indicators （Rules）.Prometheus server Calculate the alarm rules periodically , If the alarm triggering conditions are met , An alarm message is generated , And push it to Alertmanager Components . After receiving the alarm information ,Alertmanager Can handle alarms , Grouping （grouping） And route them （routing） To the right receiver （receiver）, Such as Email、PagerDuty and HipChat etc. , Finally, the notification of abnormal events is sent to the receiver .

Grouping mechanism （Grouping） Refer to ,AlertManager Group alarms of the same type , Merge multiple alarms into one notification . In the real world , Especially when there is dense coupling between business lines in the cloud computing environment , If more than one device goes down , It may cause hundreds of alarms to be triggered . In this case, use the grouping mechanism , These triggered alarms can be combined into one alarm for notification , So as to avoid suddenly receiving a large number of alarm notifications , This makes it impossible for the administrator to quickly locate the problem .

Alertmanager Inhibition mechanism of （Inhibition） Refer to , When an alarm has been sent , Stop sending the alarm mechanism of other abnormalities or faults caused by this alarm repeatedly . In the production environment , for example IDC In the managed cabinet , If each cabinet access layer is only a single switch , Then the failure of the cabinet access switch will cause the servers in the cabinet to be non UP Status alert ; In addition, if the application deployed on the server is inaccessible, the alarm will also be triggered . here , You can configure the Alertmanager Ignore the alarm caused by the inaccessibility of all servers and their applications in the cabinet caused by switch failure .

Silence （Silences） Provides a simple mechanism , The alarm can be quickly processed silently according to the tag . Check the matching of the incoming alarm , If the received alarm conforms to the silent configuration ,Alertmanager No alarm notification will be sent . Administrators can work directly in Alertmanager Of Web Temporarily mask the specified alarm notification in the interface .

notice alertmanager Configuration file formats usually include global（ Global configuration ）、templates（ Alarm template ）、route（ Alert routing ）、receivers（ Receiver ） and inhibit_rules（ Inhibition rules ） And other main configuration item modules .

global That is, global configuration , stay Alertmanager In profile , As long as the options configured in the global configuration item are public settings , It can be used as the default value of other configuration items , It can also be overwritten by the settings in other configuration items .
route Alert routing The module describes when received Prometheus server After the generated alarm , Send the alarm to receiver Rules for specified destination addresses .
receivers Receiver Is a general designation , Every receiver You need to set a globally unique name , And corresponds to one or more notification methods , Including email 、 WeChat 、PagerDuty、HipChat and Webhook etc. .
inhibit_rule modular Set in to realize the alarm suppression function , We can specify the alarm conditions to be ignored under specific conditions . You can use this option to set preferences , For example, give priority to some alarms , If the alarms in the same group occur at the same time , Then ignore other alarms .

Prometheus Start by collecting information about monitoring targets , To trigger an alarm .

1） Define the rules .

stay Prometheus In profile , To configure scrape_interval:15s（ The default value is 1min） Collection cycle for collecting monitoring target information , And configure the corresponding alarm rules .scrape_interval It can be a global setting , It can also be single metric Definition .

2） Period calculation .

When calculating the corresponding expression , stay Prometheus Configuration in profile evaluation_interval:15s（ The default value is 1min） Calculate the period for the alarm rule ,evaluation_interval Just calculate the cycle value globally .

3） Alarm state transition .·

When the alarm rule condition is found to be true for the first time , That is, the expression is true , And it does not meet the alarm rules for When the duration specified in Clause , The alarm state is switched to PENDING.
If in the next calculation cycle , The expression is still true , And meet the alarm rules for When the duration specified in Clause , The alarm state changes to FIRING, That is to say active, Alarm is Prometheus Send to Alertmanager Components .
If the next calculation cycle , The expression is still true , And meet the alarm rules for When the duration specified in Clause , Continuously send alarms to Alertmanager Components .
Until a certain calculation period , Expression is false , The alarm status will change to inactive, And there will be one resolve Sent to Altermanger, Used to indicate that this alarm has been resolved .

原网站

版权声明
本文为[key_ 3_ feng]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202142223339874.html

当前位置：网站首页>Yyds dry goods inventory Prometheus alarm Art

Yyds dry goods inventory Prometheus alarm Art

边栏推荐

猜你喜欢

随机推荐