当前位置:网站首页>[serial] shuotou O & M monitoring system 01 overview of monitoring system
[serial] shuotou O & M monitoring system 01 overview of monitoring system
2022-06-26 20:48:00 【Long yuan, Qin Wu】
This series of courses , Looking at solutions across the industry , Make horizontal comparison , Then take the Nightingale monitoring system as the blueprint , Introduce all aspects of a monitoring system . After learning this textbook , Have a very comprehensive understanding of the monitoring system . Suits the crowd :DevOps The engineer 、SRE、 R & D Engineer . The author has 10 years DevOps R & D experience ,8 Years of experience in monitoring system development , So this course is not only about operation , Many principles will be explained .
Monitoring Overview
What problem does the monitoring system solve ? The so-called monitoring system , In fact, it is just one of the three pillars of observability , What are the three pillars of observability ? Indicator monitoring system as one of the pillars , What are the specific features ? This chapter focuses on answering these questions .
Source of demand
The initial requirements , In fact, there is only one sentence , That is, when the system goes wrong, we can sense it in time . Of course , With the development of the times , We put forward more demands for the monitoring system , such as :
Learn about trends through monitoring , Know that the system may go wrong at some time in the future Know the water level of the system through monitoring , It can be expanded in time when resources are insufficient Monitor the pulse system , Sense where optimization is needed , For example, the tuning of some middleware parameters Insight into the business through monitoring , Know about the business development , When the business is abnormal, it can be perceived in time
The monitoring system is becoming more and more important , It can not only solve the above demands , It can also precipitate knowledge , Knowledge about stability deposited in the monitoring system , It may be richer than the brains of many engineers . Of course , This benefits from the continuous operation of the monitoring system , Especially the continuous operation results of some senior engineers .
Three pillars of observability
What we usually call a monitoring system , In fact, it is just indicator monitoring , A line chart shown on the chart , For example, a certain machine CPU utilization , Or the traffic of a database instance , Or the number of people on the site , Are reflected as a line that changes over time , such as :

Indicator monitoring can only process numbers , Low cost of historical data storage , Good real-time , The ecology is huge , It is the most important pillar in the field of observability . Many systems only process indicator data , such as Zabbix、Open-Falcon、Prometheus、Nightingale These systems are widely used in the industry .
In addition to indicator monitoring , Another important pillar of observability , It's a journal . You can also get a lot of information from the log , For understanding the operation of the software 、 Business operation , It's all critical . For example, operating system logs 、 Log of access layer 、 Service log , Are important data sources , From the operating system log , We can know that many system level events have occurred , From the log of the access layer , You can know which domain names there are 、IP、URL Received a visit , Success, delay, etc , From the service log, you can find Exception Information about , Call stack, etc , For troubleshooting , It's critical .
The scenario of processing logs , There are also many specialized systems , The first open source product is ELK, Commercial products such as Splunk、Datadog etc. , Here is ELK A screenshot of the query log in :

Understand the indicators and logs , Three pillars and one ring can be observed , namely : Link tracking . With the popularization of microservices , The original single application is split into many small services , There are intricate invocation relationships between services , A problem , Which module caused it , It is not easy to check .
The idea of link tracking is , Request to connect upstream and downstream modules in series , Generate a random string for each request ID, When calling between services, this ID Pass down layer by layer , How long did each layer take , Is it handled normally , Can be collected and attached to this request ID On , When you follow up the problem later , Take the request ID You can extract all the information in series . There are also many products in the field of link tracking , such as Skywalking、Jaeger、Zipkin etc. , They are all the best . Here is Zipkin A screenshot of :

Although we divide the field of observability into 3 Big pillar , In fact, there is a strong relationship between them . For example, we often extract indicators from logs , Transfer to the indicator monitoring system , Or extract link information from the log for analysis , There are many practices in the industry .
Indicators monitor product features
We focus on indicator monitoring , Logging and link tracing are not discussed , To understand indicator monitoring , First of all, we need to understand what is an indicator , To put it bluntly , An indicator is a measure of a goal . such as Linux operating system , We can measure its load from many aspects , such as CPU The utilization rate of cpu_usage_system(CPU Proportion of kernel state time )、cpu_usage_user(CPU Proportion of user status time )、cpu_usage_idle(CPU The proportion of free time ) Equal index , Memory is mem_available_percent( Memory availability )、mem_used( Memory usage ) Equal index , Disk has disk_used_percent( Disk usage )、diskio_write_bytes( Amount of disk writes ) Equal index .
commonly , Will be in OS Install a client software in , Run as a resident process , Collect at a fixed frequency ( such as 15 second ), After collecting the data , Send it to the server for storage and analysis .
so , Several characteristics of indicator monitoring :
Generally, only numerical data is processed , Do not process strings ( Individual monitoring systems can also handle strings , Most don't deal with ) Index data is time series data , Once every fixed interval , Report the collected data , Never stop Indicator data is sampling data , such as 15 Every second , Only the data collected at that moment can be obtained , If you turn down the acquisition frequency , You can get richer and more accurate data , But the cost will be even greater , Need more storage , More computing power to deal with , Actual production environment ,30 Seconds or 60 Seconds is enough , If the accuracy requirements are high ,15 Seconds is enough , The smaller the frequency , It doesn't make much sense , Because the core of monitoring data is to perceive anomalies and trends , If there is an anomaly , Generally, the abnormality will last for a period of time , Occasional exceptions usually do not require attention , Therefore, the sampled data can usually sense exceptions , And for trends , If the viewing time is longer , The higher the acquisition frequency , If you look at 1 Hours of data ,15 One second is OK , If you look at 1 Years of data , The frequency of one point per hour is also sufficient , Too many data points , May blow up the browser Because it is time series data to be processed , Each value has a timestamp , This kind of data is very regular , There are databases in the industry that focus on such data processing , It is called time series database , such as InfluxDB、VictoriaMetrics、M3DB etc. , The monitoring system relies on a time series database , It becomes a typical architectural feature
Architecturally , In addition to relying on a timing library, the index monitoring system , A collector is also necessary to collect various index data , The second is alarm engine and visual display , The typical system architecture is as follows :

Collector Indicates the collector , There are also many open source projects that specialize in collectors , such as Telegraf、Grafana-Agent、Datadog-Agent、Categraf, as well as ,Prometheus All kinds of ecological Exporter. The orange part is the monitoring server , Usually include UI Display capabilities and server alarm engine . Last , Rely on a temporal database :Time Series Database.
That's all for this chapter , The following content will be in The author blog Continuous updating , You are welcome to continue to pay attention .
author : Long yuan, Qin Wu , The Internet ID: UlricQin, Welcome to your attention nightingale : A cloud native monitoring system , Domestic open source , It belongs to the open source development committee of the Chinese computer society , Project master station : https://n9e.github.io/
边栏推荐
- leetcode刷题:哈希表08 (四数之和)
- 【最详细】最新最全Redis面试大全(42道)
- Developer survey: rust/postgresql is the most popular, and PHP salary is low
- Stringutils judge whether the string is empty
- Idea error: process terminated
- 抖音实战~搜索页面~扫描二维码
- Detailed explanation of shutter textfield
- 定长内存池
- 515. find the maximum value in each tree row
- 孙老师版本JDBC(2022年6月12日21:34:25)
猜你喜欢

Database SQL statement writing

Flutter TextField详解

MySQL - database creation and management

关于Qt数据库开发的一些冷知识
Mongodb implements creating and deleting databases, creating and deleting tables (sets), and adding, deleting, modifying, and querying data

回溯思路详解

论数据库的传统与未来之争之溯源溯本----AWS系列专栏

数据库SQL语句撰写

Gamefi active users, transaction volume, financing amount and new projects continue to decline. Can axie and stepn get rid of the death spiral? Where is the chain tour?
![[Bayesian classification 3] semi naive Bayesian classifier](/img/9c/070638c1a613be648466e4f2bc341e.png)
[Bayesian classification 3] semi naive Bayesian classifier
随机推荐
Tiktok practice ~ search page ~ scan QR code
【山东大学】考研初试复试资料分享
云计算技术的发展与芯片处理器的关系
慕课8、服务容错-Sentinel
JS mobile terminal touch screen event
Flutter TextField详解
0基础学c语言(2)
【贝叶斯分类2】朴素贝叶斯分类器
Muke 11. User authentication and authorization of microservices
Is it safe to open a securities account? Is there any danger
[most detailed] the latest and complete redis interview (70)
抖音实战~首页视频~下拉刷新
MySQL - subquery usage
[most detailed] latest and complete redis interview (42 tracks)
Gamefi active users, transaction volume, financing amount and new projects continue to decline. Can axie and stepn get rid of the death spiral? Where is the chain tour?
Swagger: how to generate beautiful static document description pages
Daily basic use of alicloud personal image warehouse
710. random numbers in the blacklist
Detailed explanation of stored procedures in MySQL
Détails de l'annotation des ressources sentinelles