当前位置:网站首页>The road to systematic construction of geek planet business monitoring and alarm system

The road to systematic construction of geek planet business monitoring and alarm system

2022-06-22 20:45:00 Mobtech mobo Technology

 Insert picture description here

Background and current situation of systematic construction of monitoring system

The importance of monitoring system to business system is self-evident , But how to choose the monitoring system , And how to realize the functions of system monitoring and alarm , It has always been two difficult problems in the monitoring system . This article will decrypt the processing and suggestions of data intelligent enterprises in the monitoring system , Discuss with you .
Common business monitoring systems usually implement monitoring at the operating system level first , At present, this part of technology has been relatively mature , And then expand other monitoring , Such as :Zabbix、 millet Open-Falcon. Of course, there are also monitoring systems that support both , Such as Prometheus. If the business monitoring requirements are high , It is suggested that developers give priority to Prometheus.

As of 2022 year 5 month , Google provides monitoring system usage distribution :
 picture : https://uploader.shimo.im/f/N2U2UNsvPMoB1ET4.png!thumbnail?accessToken=eyJhbGciOiJIUzI1NiIsImtpZCI6ImRlZmF1bHQiLCJ0eXAiOiJKV1QifQ.eyJhdWQiOiJhY2Nlc3NfcmVzb3VyY2UiLCJleHAiOjE2NTU4Nzk2NzIsImZpbGVHVUlEIjoiUjEzajhMZTR5ZEh4eGRrNSIsImlhdCI6MTY1NTg3OTM3MiwidXNlcklkIjoyNzM2MjY5M30.uzeJGCmyOebZhNeNpva9oiL5nr1Uhm9hXmhTltYdgq8 Monitoring system usage distribution

Open source monitoring and alarm system ——Prometheus

Prometheus( Prometheus ) By SoundCloud Developed open source monitoring and alarm system and time series database (TSDB), It is a monitoring, acquisition and data storage framework ( Monitoring server side ), Support for multiple exporter Collect index data , And support PushGateway Report data , The performance is enough to support tens of thousands of clusters , Compared with other monitoring systems push The way of data ,Prometheus We use the pull The way .

 picture : https://uploader.shimo.im/f/BgGwtmSa5zvpOKZU.png!thumbnail?accessToken=eyJhbGciOiJIUzI1NiIsImtpZCI6ImRlZmF1bHQiLCJ0eXAiOiJKV1QifQ.eyJhdWQiOiJhY2Nlc3NfcmVzb3VyY2UiLCJleHAiOjE2NTU4Nzk2NzIsImZpbGVHVUlEIjoiUjEzajhMZTR5ZEh4eGRrNSIsImlhdCI6MTY1NTg3OTM3MiwidXNlcklkIjoyNzM2MjY5M30.uzeJGCmyOebZhNeNpva9oiL5nr1Uhm9hXmhTltYdgq8

1、 The basic principle :
Prometheus The basic principle is through HTTP Protocol periodically grabs the status of monitored components . The advantage of this is , Any component can only provide HTTP The interface can be connected to the monitoring system , No need for any SDK Or other integration processes , This is ideal for virtualized environments , Such as :VM、Docker .Prometheus It is one of the few suitable for Docker、Mesos、Kubernetes One of the environmental monitoring systems .

2、Exporter:
Output information of monitored components HTTP The interface is called exporter . At present, most of the components commonly used by Internet companies can be used directly exporter, Such as :Varnish、Haproxy、Nginx、MySQL、Linux system information ( Include disk 、 Memory 、CPU、 Network, etc. ). The type of data collected depends on Exporter( Monitoring client ), For example, collecting MySQL Your data needs to use mysql_exporter. When Prometheus call mysql_expoter Collect to MySQL After the monitoring indicators , Store the collected data in Prometheus In the disk data file of the server . Its components are basically Golang Compiling , Very friendly to compilation and deployment , And there is no special dependency , Basically, they work independently .

Exporter The types are as follows :

  • official Exporter Address :https://github.com/prometheus;
  • node_exporter:Linux Class operating system related data acquisition program ;
  • jmx_exporter:Java Process indicator acquisition program ;
  • mysqld_exporter:MySQLserver Data acquisition program ;
  • redis_exporter:Redis Data acquisition program

3、Prometheus framework :

 picture : https://uploader.shimo.im/f/1S3Ortvi4iPRcjX1.png!thumbnail?accessToken=eyJhbGciOiJIUzI1NiIsImtpZCI6ImRlZmF1bHQiLCJ0eXAiOiJKV1QifQ.eyJhdWQiOiJhY2Nlc3NfcmVzb3VyY2UiLCJleHAiOjE2NTU4Nzk2NzIsImZpbGVHVUlEIjoiUjEzajhMZTR5ZEh4eGRrNSIsImlhdCI6MTY1NTg3OTM3MiwidXNlcklkIjoyNzM2MjY5M30.uzeJGCmyOebZhNeNpva9oiL5nr1Uhm9hXmhTltYdgq8 Prometheus Architecture Overview

1)Prometheus Server
Mainly responsible for data acquisition and storage , Provide PromQL Query language support .Server Through the configuration file 、 text file 、ZooKeeper、Consul、DNS SRV Lookup And so on . According to these goals ,Server Grab it regularly metrics data , Each grab target needs to expose one http The interface of the service is used for Prometheus Grasp regularly . This method of calling the monitored object to obtain monitoring data is called Pull.Pull The way to do it Prometheus Unique design philosophy , This is similar to what most people use Push The monitoring system is different .

2)PushGateway
Prometheus Temporary support Job The middle gateway of active push index . Some existing systems are through Push Realized by , To access this system ,Prometheus Provide right PushGateway Support for . These systems actively push metrics To PushGateway, and Prometheus It's just going on a regular basis Gateway Grab data up .

3)AlertManager
AlertManager Is independent of Prometheus A component of , Can support Prometheus Query statement , And triggered a preset in Prometheus After the advanced rules in ,Prometheus It will push the alarm message to AlertManager.

4)Exporter
Exporter yes Prometheus The general term for a class of data acquisition components . It collects data from the target , And turn it into Prometheus Supported format . Different from traditional data acquisition components , It doesn't send data to a central server , It's waiting for the central server to come forward and grab .Prometheus There are many types of Exporter Various services are used to collect different states . Currently supported are databases 、 Hardware 、 Message middleware 、 The storage system 、HTTP The server 、JMX etc. .

5)HTTP API
This is a query method , You can customize the output you need .

6)Prometheus Supports two storage modes   
The first is local storage , adopt Prometheus The self-contained timing database saves the data to the local disk , For performance , It is recommended to use SSD. But the capacity of local storage is limited , It is recommended not to save data for more than one month .

The other is remote storage , Suitable for storing a large number of monitoring data . Through the transformation of the m-server adapter , at present Prometheus Support OpenTSDB、InfluxDB、Elasticsearch Wait for backend storage . Through the adapter Prometheus Stored remote write and remote read Interface , You can access Prometheus Use as remote storage .

7)Prometheus The core value of
System monitoring : Mainly follow up the basic monitoring items of the operating system , Such as CPU、 Memory hard disk 、IO、TCP Connect 、 Inlet and outlet flow ;

Program monitoring : Generally, you need to cooperate with developers , Actively report various acquired data or specific log formats in the program ;

Business monitoring : Can include user access QPS、DAU Diurnal activity 、 Access status (Http code)、 Business interface ( Log in 、 register 、 Chat 、 Upload 、 Leaving a message. 、 SMS 、 Search for )

 Insert picture description here

Open source analysis and monitoring platform ——Grafana

1、 Basic introduction
Grafana Is an open source analysis and monitoring platform , Support Graphite,InfluxDB,OpenTSDB,Prometheus,Elasticsearch,CloudWatch Equal data source , Its UI Very beautiful and highly customized , choice Prometheus + Grafana The plan , It can meet the monitoring needs of most small and medium-sized teams .

 Insert picture description here

After adding the data source , According to the actual demand , Add custom dashboard and view components .

 Insert picture description here

 Insert picture description here

2、 Five step integration springboot project

First step newly build springboot project , Depends on the following :

 Insert picture description here
 Insert picture description here
 Insert picture description here

The second step Start the service , Browser view service
http://ip:8081/actuator/prometheus
 Insert picture description here
 Insert picture description here

The above data is Prometheus Format of data collected , Including indicators and data .

The third step Use the above application interface as job Access Prometheus, modify Prometheus Of /prometheus.yml

 Insert picture description here

Step four restart prometheus
systemctl restart prometheus, here prometheus Of targets One will be added to the

 picture : https://uploader.shimo.im/f/yws8DaKj8MTYeOut.png!thumbnail?accessToken=eyJhbGciOiJIUzI1NiIsImtpZCI6ImRlZmF1bHQiLCJ0eXAiOiJKV1QifQ.eyJhdWQiOiJhY2Nlc3NfcmVzb3VyY2UiLCJleHAiOjE2NTU4Nzk2NzIsImZpbGVHVUlEIjoiUjEzajhMZTR5ZEh4eGRrNSIsImlhdCI6MTY1NTg3OTM3MiwidXNlcklkIjoyNzM2MjY5M30.uzeJGCmyOebZhNeNpva9oiL5nr1Uhm9hXmhTltYdgq8 restart prometheus

The above is what we offer to Prometheus Indicators collected by default , We can pass based on these indicators PromQL Inquire about , The query results are in Grafana Show in .

 Insert picture description here

Step five For the convenience of display , We are Grafana Add panel to , Save and display to our dashboard .

 Insert picture description here

meanwhile ,Grafana Add JVM Templates .Grafana The official website provides many templates , We just need to enter the template number , You can complete the configuration of the dashboard .
For more templates, see :https://grafana.com/grafana/dashboards/

 Insert picture description here

JVM Related common indicator monitoring , This completes the configuration , The above configuration can also be based on the actual situation , Adjust the presentation information of each module .

 Insert picture description here

3、 Configure alarm notification
1) Alarm notification module
At this stage , We have been able to pass Grafana Show collecting and viewing data . Think about it , What else is missing from the system ? The most important purpose of monitoring is , Judge whether the monitoring system is normal , And when the system is abnormal , Inform relevant personnel to check and remove problems in time , Namely, alarm notification . therefore , A module for alarm notification is also missing .prometheus The alarm mechanism of is composed of the following two parts :

Alarm rules
prometheus According to the alarm rules rule_files, Send the alarm to Alertmanager.

Manage alarms and notifications
The module is Alertmanager, It's responsible for managing alarms 、 Remove duplicate data 、 Warning notice . There are many options for notification , Such as Email、HipChat、Slack、WebHook wait .

The actual operation is : Create a new rule file alert_rules.yml

 Insert picture description here

The main idea of the above code is to create 1 strip alert The rules ——InstanceDown.

InstanceDown It's instance downtime (up == 0) Trigger alarm ,5 Alarm in minutes (for:5m).

  • alert: Name of alarm rule . expr: be based on PromQL Expression alarm trigger condition , It is used to calculate whether a time series satisfies the condition .
  • for: Evaluate the waiting time , Optional parameters . It is used to indicate that the alarm will be sent only after the trigger condition lasts for a period of time . The status of the newly generated alarm during the waiting period is pending.
  • labels: Custom tag , Allows the user to specify a set of additional labels to be attached to the alarm .
  • annotations: Used to specify a set of additional information , For example, the text used to describe the alarm details ,annotations When the alarm is generated, the contents of will be sent to as parameters Alertmanager.
  • summary Describe the general information of the alarm ,description Used to describe the details of the alarm .
  • Alertmanager Of UI Also based on the above two tag values , Display alarm information .

towards promethues.yml Add alarm rules :

 Insert picture description here

restart promethues Can be in Alerts See the alarm rules defined in :

 Insert picture description here

Status description :Prometheus Alert There are three alarm states , Include Inactive、Pending、Firing;

  • Inactive: Inactive state , Means monitoring , But no alarm has been triggered yet ;
  • Pending: Indicates that this alert must be triggered . Because alerts can be grouped 、 Repression / Restrain or silence / Mute , So wait for validation , Once all the validation passes , Go to Firing
    state ;
  • Firing: Send alerts to AlertManager, It will send alerts to all recipients as configured . Once the alarm goes off , Go to Inactive state , So circular

Simulate service downtime :kill process

 Insert picture description here

At this point, the conditions are satisfied up == 0,State Become PENDING

 Insert picture description here

When the time exceeds the evaluation waiting time 5m after (for:5m),State become FIRING, An alarm will be generated . And send it to Alertmanager Handle .

 Insert picture description here

2) Alarm notification module
Above we can see alerts Alarm information on the page , But how to inform the R & D and business related personnel of the alarm information ? This operation needs to be performed by Alertmanager complete .

So let's configure alertmanager file alertmanager.yml

alertmanager After receiving the alarm notification , According to the current configuration , E.g. mail , Send alarm information :

 picture : https://uploader.shimo.im/f/ka0qoVn2AnTzszGm.png!thumbnail?accessToken=eyJhbGciOiJIUzI1NiIsImtpZCI6ImRlZmF1bHQiLCJ0eXAiOiJKV1QifQ.eyJhdWQiOiJhY2Nlc3NfcmVzb3VyY2UiLCJleHAiOjE2NTU4Nzk2NzIsImZpbGVHVUlEIjoiUjEzajhMZTR5ZEh4eGRrNSIsImlhdCI6MTY1NTg3OTM3MiwidXNlcklkIjoyNzM2MjY5M30.uzeJGCmyOebZhNeNpva9oiL5nr1Uhm9hXmhTltYdgq8 alertmanager Send alarm information

 Insert picture description here

When our service restarts , When the problem is fixed , You will also receive an email informing you that . If you don't want to receive this email , Can be found in alertmanager.yml To configure :receivers.email_configs.send_resolved:false that will do

 Insert picture description here

Prometheus And Grafana comparative analysis

1、Prometheus Analysis of advantages and disadvantages :
The monitoring data is stored in the database based on time series , Facilitate data aggregation . Each component has a mature high availability solution , No single point of failure 、 Independent of distributed storage 、 A single server node can work directly . It uses PromSQL, This is a flexible query language , You can use multidimensional data to complete complex queries . At the same time, it supports a variety of charts and interface display , In general, and Grafana In combination with .

Prometheus Is based on indicators (Metric) Monitoring of , Not applicable to logs (Logs)、 event (Event)、 Call chain (Tracing). It shows more trend monitoring , Not precise data ;Prometheus Only the latest monitoring data can be queried , Its local storage is designed for short-term storage ( For example, one month ) The data of , Therefore, the storage of a large amount of historical data is not supported . If you need to store long-term historical data , It is recommended to save data in based on remote storage mechanism InfluxDB or OpenTSDB Etc. in the system .

in addition ,alertmanager The matching rule configuration of is very complex , Is in prometheus Second customization based on time series database . This also means that if you change the database in the middle , The previous configuration needs to be readjusted . This is very disadvantageous for later configuration development , In particular, its key value also depends on exporter Of metrics Information , From a coding point of view , There is a lack of interface encapsulation , It is a big challenge for the maintainability of the program .

2、Grafana Analysis of advantages and disadvantages :
Grafana High availability of , There are many chart plug-ins , It can be selected according to your own needs , And the community has uploaded a lot of Dashboard available , So its advantages are similar to other components , It's all open source 、 Out of the box, etc .

But its community contribution also brings the same exporter The same question , First of all Dashboard and exporter The matching degree of is generally not high , It takes a lot of time to make secondary adjustments after downloading . besides , Access api Not friendly either , For example, access prometheus, Its access api That is to say prometheus Time series database query statements , Because the operator is not familiar with it , It also takes a lot of time to understand . Finally, there is the lack of list information , Community Dashboard It is mainly composed of various charts , It can meet the needs of overall analysis , However, it seems that we are unable to solve the problem .

Taken together ,Prometheus and Grafana The combination is basically the standard configuration of the current mainstream monitoring system ,Prometheus Do the storage backend , and Grafana Responsible for analysis and visual interface , It's time-saving and labor-saving , The most efficient solution , It can almost meet the monitoring needs of most enterprises for systems and businesses . Of course, everyone's industry is different 、 Different companies 、 Business is different 、 Different positions 、 The understanding of monitoring is also different , There are also better open source monitoring frameworks, such as Sensu etc. , Plus influxdb、grafana It can be used to customize the monitoring platform for your own enterprise . most important of all , We need to pay attention , Monitoring needs to be considered from the perspective of the company's business , Not for the use of a monitoring technology .

原网站

版权声明
本文为[Mobtech mobo Technology]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206221909426170.html