当前位置：网站首页>[serial] shuotou O & M monitoring system 01 overview of monitoring system

[serial] shuotou O & M monitoring system 01 overview of monitoring system

2022-06-26 20:48:00 【Long yuan, Qin Wu】

This series of courses , Looking at solutions across the industry , Make horizontal comparison , Then take the Nightingale monitoring system as the blueprint , Introduce all aspects of a monitoring system . After learning this textbook , Have a very comprehensive understanding of the monitoring system . Suits the crowd ：DevOps The engineer 、SRE、 R & D Engineer . The author has 10 years DevOps R & D experience ,8 Years of experience in monitoring system development , So this course is not only about operation , Many principles will be explained .

Monitoring Overview

What problem does the monitoring system solve ？ The so-called monitoring system , In fact, it is just one of the three pillars of observability , What are the three pillars of observability ？ Indicator monitoring system as one of the pillars , What are the specific features ？ This chapter focuses on answering these questions .

Source of demand

The initial requirements , In fact, there is only one sentence , That is, when the system goes wrong, we can sense it in time . Of course , With the development of the times , We put forward more demands for the monitoring system , such as ：

Learn about trends through monitoring , Know that the system may go wrong at some time in the future
Know the water level of the system through monitoring , It can be expanded in time when resources are insufficient
Monitor the pulse system , Sense where optimization is needed , For example, the tuning of some middleware parameters
Insight into the business through monitoring , Know about the business development , When the business is abnormal, it can be perceived in time

The monitoring system is becoming more and more important , It can not only solve the above demands , It can also precipitate knowledge , Knowledge about stability deposited in the monitoring system , It may be richer than the brains of many engineers . Of course , This benefits from the continuous operation of the monitoring system , Especially the continuous operation results of some senior engineers .

Three pillars of observability

What we usually call a monitoring system , In fact, it is just indicator monitoring , A line chart shown on the chart , For example, a certain machine CPU utilization , Or the traffic of a database instance , Or the number of people on the site , Are reflected as a line that changes over time , such as ：

Indicator monitoring can only process numbers , Low cost of historical data storage , Good real-time , The ecology is huge , It is the most important pillar in the field of observability . Many systems only process indicator data , such as Zabbix、Open-Falcon、Prometheus、Nightingale These systems are widely used in the industry .

In addition to indicator monitoring , Another important pillar of observability , It's a journal . You can also get a lot of information from the log , For understanding the operation of the software 、 Business operation , It's all critical . For example, operating system logs 、 Log of access layer 、 Service log , Are important data sources , From the operating system log , We can know that many system level events have occurred , From the log of the access layer , You can know which domain names there are 、IP、URL Received a visit , Success, delay, etc , From the service log, you can find Exception Information about , Call stack, etc , For troubleshooting , It's critical .

The scenario of processing logs , There are also many specialized systems , The first open source product is ELK, Commercial products such as Splunk、Datadog etc. , Here is ELK A screenshot of the query log in ：

Understand the indicators and logs , Three pillars and one ring can be observed , namely ： Link tracking . With the popularization of microservices , The original single application is split into many small services , There are intricate invocation relationships between services , A problem , Which module caused it , It is not easy to check .

The idea of link tracking is , Request to connect upstream and downstream modules in series , Generate a random string for each request ID, When calling between services, this ID Pass down layer by layer , How long did each layer take , Is it handled normally , Can be collected and attached to this request ID On , When you follow up the problem later , Take the request ID You can extract all the information in series . There are also many products in the field of link tracking , such as Skywalking、Jaeger、Zipkin etc. , They are all the best . Here is Zipkin A screenshot of ：

Although we divide the field of observability into 3 Big pillar , In fact, there is a strong relationship between them . For example, we often extract indicators from logs , Transfer to the indicator monitoring system , Or extract link information from the log for analysis , There are many practices in the industry .

Indicators monitor product features

We focus on indicator monitoring , Logging and link tracing are not discussed , To understand indicator monitoring , First of all, we need to understand what is an indicator , To put it bluntly , An indicator is a measure of a goal . such as Linux operating system , We can measure its load from many aspects , such as CPU The utilization rate of cpu_usage_system（CPU Proportion of kernel state time ）、cpu_usage_user（CPU Proportion of user status time ）、cpu_usage_idle（CPU The proportion of free time ） Equal index , Memory is mem_available_percent（ Memory availability ）、mem_used（ Memory usage ） Equal index , Disk has disk_used_percent（ Disk usage ）、diskio_write_bytes（ Amount of disk writes ） Equal index .

commonly , Will be in OS Install a client software in , Run as a resident process , Collect at a fixed frequency （ such as 15 second ）, After collecting the data , Send it to the server for storage and analysis .

so , Several characteristics of indicator monitoring ：

Generally, only numerical data is processed , Do not process strings （ Individual monitoring systems can also handle strings , Most don't deal with ）
Index data is time series data , Once every fixed interval , Report the collected data , Never stop
Indicator data is sampling data , such as 15 Every second , Only the data collected at that moment can be obtained , If you turn down the acquisition frequency , You can get richer and more accurate data , But the cost will be even greater , Need more storage , More computing power to deal with , Actual production environment ,30 Seconds or 60 Seconds is enough , If the accuracy requirements are high ,15 Seconds is enough , The smaller the frequency , It doesn't make much sense , Because the core of monitoring data is to perceive anomalies and trends , If there is an anomaly , Generally, the abnormality will last for a period of time , Occasional exceptions usually do not require attention , Therefore, the sampled data can usually sense exceptions , And for trends , If the viewing time is longer , The higher the acquisition frequency , If you look at 1 Hours of data ,15 One second is OK , If you look at 1 Years of data , The frequency of one point per hour is also sufficient , Too many data points , May blow up the browser
Because it is time series data to be processed , Each value has a timestamp , This kind of data is very regular , There are databases in the industry that focus on such data processing , It is called time series database , such as InfluxDB、VictoriaMetrics、M3DB etc. , The monitoring system relies on a time series database , It becomes a typical architectural feature

Architecturally , In addition to relying on a timing library, the index monitoring system , A collector is also necessary to collect various index data , The second is alarm engine and visual display , The typical system architecture is as follows ：

Collector Indicates the collector , There are also many open source projects that specialize in collectors , such as Telegraf、Grafana-Agent、Datadog-Agent、Categraf, as well as ,Prometheus All kinds of ecological Exporter. The orange part is the monitoring server , Usually include UI Display capabilities and server alarm engine . Last , Rely on a temporal database ：Time Series Database.

That's all for this chapter , The following content will be in The author blog Continuous updating , You are welcome to continue to pay attention .

author ： Long yuan, Qin Wu , The Internet ID： UlricQin, Welcome to your attention
nightingale ： A cloud native monitoring system , Domestic open source , It belongs to the open source development committee of the Chinese computer society , Project master station ： https://n9e.github.io/

原网站

版权声明
本文为[Long yuan, Qin Wu]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/177/202206262031216398.html