当前位置:网站首页>Ideas and methods of system and application monitoring
Ideas and methods of system and application monitoring
2022-07-06 20:05:00 【Impl_ Sunny】
0、 Preface
In the actual performance analysis , A very common phenomenon is , Obviously, there is a performance bottleneck , But when you log in to the server and want to check , But found that the bottleneck has disappeared . Or say , Performance problems always happen from time to time , But it's hard to find the law of occurrence , It's hard to reproduce .
And to solve this problem , We need to build a monitoring system , Monitor the health of systems and Applications , And define a series of strategies , When a problem occurs, the alarm will be notified at the first time . A good monitoring system , It can not only expose various problems of the system in real time , It can also be based on these monitored States , Automatically analyze and locate approximate bottleneck sources , So as to report the problem to the relevant team more accurately . To do a good job of monitoring , The core is comprehensive 、 Quantifiable indicators , This includes system and application .
In terms of system , The monitoring system shall cover the overall resource utilization of the system , such as CPU、 Memory 、 Disk and file system 、 Network and other system resources .
In terms of applications , The monitoring system should cover the internal operation status of the application , This includes both the... Of the process CPU、 disk I/O And so on , More needs to be included, such as time-consuming interface calls 、 Errors in execution 、 Memory usage of internal objects and other internal running conditions of the application .
One 、 System monitoring
1.1 USE Law
Before starting the monitoring system , You must want to know most , How can we use a concise method , To describe the use of system resources . You can certainly use the various performance tools learned in this column , To collect the usage of various resources . But don't forget , There are many performance indicators for each resource , Using too many indicators is time-consuming and labor-consuming. Don't say , It's not easy to establish the overall operation status of the system for you .
ad locum , I'd like to introduce you to a system for performance monitoring USE(Utilization Saturation and Errors) Law .
USE The performance index of system resources , Reduced to three categories , I.e. utilization rate 、 Saturation and number of errors :
- Usage rate : Indicates the percentage of time or capacity that the resource is used for service .100% The usage rate of , Indicates that the capacity has been exhausted or all the time is used for service .
- saturation : Indicates how busy the resource is , Usually related to the length of the waiting queue .100% Saturation of , Indicates that the resource cannot accept more requests .
- Wrong number : Indicates the number of events with errors . The more errors , It indicates that the more serious the problem of the system .
Indicators of these three categories , It covers the common performance bottlenecks of system resources , Therefore, it is often used to quickly locate the performance bottleneck of system resources . such , Both for CPU、 Memory 、 Disk and file system 、 Network and other hardware resources , Or the number of file descriptors 、 The number of connections 、 Connection tracking number and other software resources ,USE Methods can help you quickly locate , Which system resources have performance bottlenecks .
1.2 Performance indicators
The following are common performance indicators :
Be careful :USE The method only focuses on the core indicators that can reflect the bottleneck of system resource performance , But this is not to say that other indicators are not important . Such as system log 、 Process resource usage 、 Cache usage and other indicators , We also need to monitor . It's just , They are often used as an aid to performance analysis , and USE Indicators of the method , It directly indicates the resource bottleneck of the system .
1.3 The monitoring system
master USE Methods and performance indicators to be monitored , What to do next , Is to establish a monitoring system , Save these indicators ; then , Based on these monitored States , Automatically analyze and locate approximate bottleneck sources ; Last , Then through the alarm system , Report the problem to the relevant team in time . A complete monitoring system usually consists of data acquisition 、 data storage 、 Data query and processing 、 Alarm, visual display and other modules . therefore , To build a monitoring system from scratch , In fact, it is also a big system engineering .
Refer to the scheme for setting up the monitoring system :
Prometheus and Zabbix Comparison of
Prometheus+Grafana Installation tutorial practice
1.4 Summary
The core of system monitoring is the use of resources , This includes both CPU、 Memory 、 disk 、 file system 、 Network and other hardware resources , It also includes the number of file descriptors 、 The number of connections 、 Connection tracking number and other software resources . To describe these resource bottlenecks , The easiest and most effective way is USE Law .
USE The performance index of system resources , Reduced to three categories : Usage rate 、 Saturation and number of errors . When the indicators of any of these three categories are too high , All represent that the corresponding system resources may have performance bottlenecks .
be based on USE After establishing the performance index , We also need a complete monitoring system , Collect these indicators from 、 Storage 、 Inquire about 、 Handle , Then to alarm and visual display . such , It can not only quickly expose the bottleneck of system resources , You can also use the monitored historical data , To track and locate the root cause of performance problems .
Two 、 Application monitoring
2.1 Application monitoring indicators
Like system monitoring , Before building the application's monitoring system , First of all, we also need to determine , What indicators need to be monitored . Especially be clear , What metrics can be used to quickly identify application performance problems .
2.1.1 Core indicators
Core metrics for applications , It is no longer the use of resources , It's the number of requests 、 Error rate and response time .
These indicators are not only directly related to the user experience , It also reflects the overall availability and reliability of the application . With the number of requests 、 After the three golden indicators of error rate and response time , We can quickly know , Whether the application has performance problems .
Request 、 Details of response time , May refer to :【 Parameters 】 Make it clear QPS、TPS、 Number of concurrent users 、 throughput
2.1.2 Performance evaluation indicators
Although we can know that the application has problems through the core indicators , however , These indicators are obviously not enough , Because after a performance problem , We also want to be able to quickly locate “ Performance bottleneck area ”. therefore , in my opinion , The following indicators , It is also essential when monitoring applications :
Resource usage of the application process :
- describe : For example, the process takes up CPU、 Memory 、 disk I/O、 Network, etc . Using too many system resources , Cause the application to respond slowly or increase the number of errors , Is one of the most common performance problems .
- effect : The bottleneck of system resources can be associated with Applications , So as to quickly locate the performance problems caused by insufficient system resources
Calls between applications :
- describe : For example, call frequency 、 Wrong number 、 Delay, etc . Because applications are not isolated , If the other applications it depends on have performance problems , The performance of the application itself will also be affected .
- effect : It can quickly analyze the call chain of a request processing , Which component is the culprit of performance problems
The operation of the core logic inside the application :
- describe : For example, the time-consuming of key links and errors in the implementation process . Because this is the internal state of the application , Detailed performance data cannot be obtained directly from the outside .
- effect : Can go further , Go directly inside the application , Locate the function of which processing link causes the performance problem .
Based on these ideas , I believe you can build , Performance indicators that describe the running state of an application . Then these indicators are incorporated into the monitoring system ( such as Prometheus + Grafana) in , It can be the same as system monitoring , On the one hand, through the alarm system , Report the problem to the relevant team in time ; On the other hand , Through an intuitive graphical interface , Dynamically demonstrate the overall performance of the application .
2.2 Full link monitoring
Business systems usually involve a series of multiple services , Form a complex distributed call chain . In order to quickly locate such cross application performance bottlenecks , You can still use it skywalking、Zipkin、Jaeger、Pinpoint And other open source tools , To build a full link tracking system .
Full link tracking can help you quickly locate , During a request processing , Which link is the root of the problem .
Full link tracking can not only help you quickly locate cross application performance problems , It can also help you generate the call topology of the online system . These intuitive topologies , In analyzing complex systems ( For example, micro services ) Especially effective when .
2.3 Log monitoring
Monitoring of performance indicators , It allows you to quickly locate the bottleneck , However, only indicators are often not enough . such as , The same interface , When the parameters passed in by the request are different , It can lead to completely different performance problems . therefore , In addition to indicators , We also need to monitor the context information of these indicators , And logs are the best source of these contexts .
By contrast , Indicators are numerical measurement data in a specific period of time , Usually in the form of time series , Suitable for real-time monitoring .
The log is completely different , Logs are string messages at a certain point in time , It is usually necessary to index the search engine , Only then can you perform query and summary analysis .
For log monitoring , The most classic method , Is the use of ELK Technology stack , That is to use Elasticsearch、Logstash and Kibana The combination of these three components . As shown in the figure below , classical ELK Architecture diagram :
- Logstash : Responsible for collecting logs from various log sources , And then preprocess , Finally, the preliminary processed logs , Send to Elasticsearch Index .
- Elasticsearch : Be responsible for indexing the log , It also provides a complete full-text search engine , This will make it easy for you to retrieve the data you need from the log .
- Kibana : Be responsible for visual analysis of logs , Including log search 、 Processing and gorgeous dashboard display .
Here's a picture , It's just one. Kibana Example of dashboard , It intuitively shows Apache An overview of your visit .
Be careful :ELK In the technology stack Logstash Resource consumption is relatively large . therefore , In a resource constrained environment , We tend to use lower resource consumption Fluentd, To replace Logstash( It's called EFK Technology stack ).
2.4 Summary
Application monitoring , It can be divided into indicator monitoring and log monitoring :
- Indicator monitoring : It mainly measures the performance indicators in a certain period of time , Then through the way of time series , To deal with 、 Storage and alarm .
- Log monitoring : More detailed context information can be provided , Usually by ELK Technology stack to collect 、 Indexing and graphical presentation .
In complex business scenarios that span multiple different applications , You can also build a full link tracking system . This can dynamically track the performance of each component in the call chain , Generate the whole call flow diagram , So as to speed up the positioning of performance problems of complex applications .
Reference material :
1. Performance optimization and testing : System monitoring 、 Apply the idea of monitoring
边栏推荐
- [cloud native and 5g] micro services support 5g core network
- Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
- Leetcode 30. Concatenate substrings of all words
- Enumeration gets values based on parameters
- OceanBase社区版之OBD方式部署方式单机安装
- What happened to the kernel after malloc() was transferred? Attached malloc () and free () implementation source
- Tencent T3 Daniel will teach you hand-in-hand, the internal information of the factory
- 2022年6月语音合成(TTS)和语音识别(ASR)论文月报
- It's enough to read this article to analyze the principle in depth
- Example of applying fonts to flutter
猜你喜欢
【Yann LeCun点赞B站UP主使用Minecraft制作的红石神经网络】
Information System Project Manager - Chapter VIII project quality management
Hudi vs Delta vs Iceberg
2022年6月语音合成(TTS)和语音识别(ASR)论文月报
PowerPivot - DAX (first time)
New generation garbage collector ZGC
[calculating emotion and thought] floor sweeper, typist, information panic and Oppenheimer
Monthly report of speech synthesis (TTS) and speech recognition (ASR) papers in June 2022
redisson bug分析
【云原生与5G】微服务加持5G核心网
随机推荐
POJ3617 Best Cow Line 馋
算法面试经典100题,Android程序员最新职业规划
腾讯T4架构师,android面试基础
Learning and Exploration - function anti shake
MySQL must know and learn
腾讯安卓开发面试,android开发的基础知识
Recursive implementation of department tree
小微企业难做账?智能代账小工具快用起来
mod_ WSGI + pymssql path SQL server seat
AsyncHandler
Groovy基础语法整理
Groovy basic syntax collation
[calculating emotion and thought] floor sweeper, typist, information panic and Oppenheimer
颜色(color)转换为三刺激值(r/g/b)(干股)
Standardized QCI characteristics
Learning and Exploration - Seamless rotation map
(3) Web security | penetration testing | basic knowledge of network security construction, IIS website construction, EXE backdoor generation tool quasar, basic use of
Li Kou 101: symmetric binary tree
AsyncHandler
After solving 2961 user feedback, I made such a change