当前位置:网站首页>CLS monitoring alarm: ensure high availability of online services in real time
CLS monitoring alarm: ensure high availability of online services in real time
2022-07-27 12:04:00 【Log service CLS assistant】
author :kingszhang
Introduction
The log service CLS It is a one-stop log service platform provided by Tencent cloud , Provides log collection 、 Storage 、 retrieval 、 Chart analysis 、 The data processing 、 Log delivery 、 Monitoring alarm 、 Dashboard visualization and other services , Assist users to solve business operation and maintenance problems 、 Various scenarios such as operation and audit .
The significance of observability
【 Availability of services 】
For any online service , Usability Are important quality indicators , Can users complete tasks with products ? How efficient ? How about your subjective feelings ? This is actually from User perspective The product quality seen , yes The core of product competitiveness , yes Product reliability 、 Comprehensive reflection of maintainability and maintenance supportability .
99% High availability of means that there are only 3.56 Days not available , and 99.9% High availability means the most services in a year 8 Hours not available ,99.99% High availability of means that the service is only 52 Minutes are not available .
More online services , Especially data services , The required availability is 5 individual 9 To 6 individual 9 Between , That is, there are at most 5 Minutes are not available , So how to complete this huge challenge , How to skillfully use the technical system to achieve this at low cost ?
【 Troubleshooting of microservices 】
For many older online services , If there is an online problem suddenly , We can log in to the machine to check the log , But it also faces the following thorny problems :
- If the number of visits is a little larger , How to deal with many logs ?
- If something goes wrong online , How to locate the cause of the problem in one minute through the log ?
- If the service call chain is long , It involves many services , How to track links ?
- There is also a soul torture , How do you know if your service is currently healthy ? How to know whether a business module in your service is healthy ?
Therefore, online services urgently need a kind of ability , It can monitor the status of the system in real time , Developers can receive warnings when exceptions occur in the system , Instead of waiting for feedback from the business side ; When a system error occurs , It can help developers locate problems faster, more accurately and faster , That is, a business Observable construction .
Microservice system needs to do 3-5-10, That is, when the system has problems ,3 Cause of the problem of minute positioning ,5 Minute repair ,10 Minutes Online . Observability system , Be able to actively discover 99% The above microservice problems , Long term construction of high-quality and reliable systems .
Business log monitoring system
At present, more and more users choose to report all business logs Tencent cloud log service CLS Inside ( Including full link traceId), Then based on the log service , Make various businesses Monitor the market And access The alarm information . Research and development can be achieved through this system , Get comprehensive information about Business system health Information about .
Where to print the log ?
It is very difficult to print the log well , Developers must understand which logs must be printed , Which logs can be omitted . Here are some very important logs , These logs must be printed .
Take hexagonal architecture as an example , As far as possible, print logs at the entrances and exits of all outbound and inbound adapters :
Use another picture to show it more clearly as follows :
Why print these logs ?
【 Why does the service entry print logs 】
- First of all , Record every request , When the log is reported , The load of the system can be better monitored according to the number of requests ;
- second , Record every response , When the response is abnormal , The log system can help developers find problems in time , Instead of waiting for feedback from the business side ;
- Third , Once there is a real problem with the service , You can log anywhere through the service entrance ( development environment / Test environment / The online environment ) Reproduce the scenario of the problem .
【 Why do dependent third-party services need to print logs 】
- First of all , Any third-party service is untrustworthy , Developers must Failure oriented programming , Complete records of requests and responses of third-party services , Instead of relying on the log information of three-party Services ;
- second , When there is a problem with your service and you really can't locate it according to the log ( Very extreme situation ), We need to reproduce the problem scene in time , However, if the reproduction is a third-party service, the response information is different , That recurrence must be unsuccessful , But if the response of the third-party service is recorded in advance , When the development environment reappears , In time mock data .
What values need to be printed in the log ?
For service entry and third-party dependent logs , Need to print response time 、 Return status code 、 Current operator 、 Called method name 、 service name 、 The caller IP、 Transferred party IP、 Line number 、 The level of logging 、 Full link ID、 Service environment Information, etc. ; For ordinary logs , Main note Full link ID that will do .
One thing in particular to note , The whole link is unique ID Transparent transmission must be carried out in all requests , Once lost , It will cause a lot of unnecessary trouble .
The reported log is shown below :
Tencent cloud log service CLS Ability demonstration
Facing the huge monitoring demands of business logs , Tencent cloud log service CLS Have 「 Ten billion level log , Second level analysis 」、「 One minute real-time alarm 」 And other product capabilities , Provide one-stop log service , Easily solve operation and maintenance 、 Operation and other scenario problems .
Let's take a look at CLS Demonstration of core functions of .
Log retrieval function
CLS The retrieval and analysis of can use Lucene and SQL grammar , Search for each field . You can also use full-text search , These are the most basic functions . After we format the log , Each field can be retrieved separately , It can be said that at present One of the most flexible, powerful and convenient retrieval tools .
At the same time, it supports rich arithmetic operators :
For details, please see : Overview and grammatical rules
Log analysis function
The most powerful thing about log service is For the search log results , have access to SQL Statement analysis ,CLS You can also do OLAP The analysis engine uses .
As shown in the figure below , The recent 15 The average time of each method in minutes :
You can also switch to different charts for display , And can be saved as a monitoring market :
Support many analysis functions :
For details, please see : Search analysis
Log monitoring
The above said , We can use SQL Statement to configure some charts , These charts can be configured as a special dashboard , For example, the Kanban of statistics on the receiving of some business data :
Making a dashboard is very simple , Just use SQL And some functions , Can write SQL The dashboard will be configured .
We can target each interface Success, failure 、 Error code 、 Interface QPS Wait to make Kanban . You can also report tomcat、trpc frame , Even with Nginx journal To do analysis and indicator board .
Monitoring alarm
1. Configure alarms
The analysis results of the above logs , Monitoring and alarm can be configured for a certain index .
The following figure shows that ERROR Level of logging , In recent 15 Aggregate in minutes , If the aggregate structure is greater than 0, The alarm is triggered .
The following figure shows the alarm effect :
among , The alarm strategy is the name of the strategy , The trigger condition is that the number of errors reported by this interface within one minute is greater than 15 On the alert . The current data is the current number of errors .
The multidimensional analysis below will show the specific Cause of error reporting and full link ID, Sure Quickly check the error information ; You can also click the link of query log , Quickly view the details of the error .
Pictured , Click to view detailed stack information and full link logs , Fast location problem .
2. Examples of business scenarios
Scene one :
A user's service suddenly gave an alarm , The number of failures in one minute reached 40 More than . Development students click the alarm link , And then use SQL A quick analysis after the statement shows that the error reports are concentrated on one machine IP On , Check the machine information and find that it is the host machine fault , So stop the machine quickly , The alarm is also cleared .
As shown in the figure below :
It only costs 1 minute , Locate the cause of the problem and solve the problem . You can see , If there is no log service , Checking on the machine one by one is like looking for a needle in a haystack , It may be impossible to locate the specific reason all morning !
Scene two :
A user's service alarm suddenly sends a telephone alarm . After we quickly analyze the full link logs , The reason for the problem is OLAP The database performance bottleneck needs to be expanded , After the expansion, the business returned to normal . It takes less than... To locate the cause of the problem and solve it 10 minute .
CLS Application in full link scenario
stay CLS in , Just have one traceId, You can query all logs at once .
The following figure shows a full link log , It is generated by multiple services , And finally converge to CLS Log platform .CLS The full link log is mainly used for log viewing and aggregation .
Conclusion
On the premise that the cost allows , In addition to business monitoring , It can be added later JVM Monitoring of , GC Monitoring of , Memory monitoring ,Trpc Thread pool monitoring . The container layer can also add memory monitoring 、CPU monitor 、 The Internet IO monitor .
Business monitoring and alarm can cover most scenarios , The rest will work in very special situations , For example, you may need to view thread monitoring information only when doing full link pressure test or performance pressure test .
For distributed services , There are many points to be concerned about to ensure service quality ( Here's the picture ), General business only needs to do well in a few points to ensure the quality of service .
Many quality problems can be found and solved in the development and online stage , Attached below is a key test point for development :
After the demand goes online , For new interfaces and services , Do a good job of monitoring and alarm , For data reporting services , If it appears after going online RT The phenomenon of increasing or increasing failure rate , It needs to be checked in time , Roll back if necessary, and then find out the reason .
some time ,CLS Will continue to polish the details of log service , Constantly improve the monitoring and alarm capability , Help users in log operation and maintenance 、 operating 、 Achieve leapfrog development in compliance audit and other businesses , Escort the high availability of online services , Benefit more operation and maintenance teams and development teams .
That's what will CLS Application practice of monitoring alarm related functions , Thank you for reading !
Join in 「 Tencent cloud log service CLS Technology exchange group 」, Get the latest news , Get more information !
边栏推荐
- go入门篇 (3)
- Some commonly used shortcut keys for MathType
- Proteus8专业版破解后用数码管闪退的解决
- 剑指 Offer 笔记: T39. 数组中出现次数超过一半的数字
- Detailed explanation of hash table
- Shell script text three swordsmen sed
- 新版数据仓库的同步使用参考(新手向)
- Analysis of the use of JUC framework from runnable to callable to futuretask
- 配置更改删除了路由过滤器,分布路由器不堪重负:加拿大网络大瘫痪
- Firewall firewall
猜你喜欢

N ¨UWA: Visual Synthesis Pre-training for Neural visUal World creAtionChenfei

MySQL数据库主从复制集群原理概念以及搭建流程

Shell脚本文本三剑客之awk

Shell编程之正则表达式(Shell脚本文本三剑客之grep)

图像分割 vs Adobephotoshop(PS)

Adobe audit prompts that the sampling rate of audio input does not match the output device - problem solving

The first case of monkeypox in pregnant women in the United States: the newborn was injected with immunoglobulin and was safely born

Strictly control outdoor operation time! Foshan housing and Urban Rural Development Bureau issued a document: strengthening construction safety management during high temperature

Shell script text three swordsmen sed
Unexpected harvest of epic distributed resources, from basic to advanced are full of dry goods, big guys are strong!
随机推荐
shell编程之免交互
【机器学习-白板推导系列】学习笔记---支持向量机和主成分分析法
严控室外作业时间!佛山住建局发文:加强高温期间建筑施工安全管理
Weibo comment crawler + visualization
mysql8msi安装教程(数据库mysql安装教程)
NewTicker使用
【机器学习-白板推导系列】学习笔记---概率图模型和指数族分布
Matlab draws Bode diagram with time delay system
CH340模块无法识别/烧写不进的一种可能性
Could not load dynamic library ‘libcudnn.so.8‘;
系统临时文件的写和读:createTempFile和tempFileContent[通俗易懂]
Sword finger offer notes: T53 - ii Missing numbers from 0 to n-1
关于离线缓存Application Cache /使用 manifest文件缓存
Check the number of file descriptors opened by each process under the system
解决@OneToMany查询陷入循环引用问题
我在英国TikTok做直播电商
omitempty在go中的使用
Could not load dynamic library ‘libcudnn.so.8‘;
The chess robot "broke" the chess boy's finger...
Leetcode 04: T26. Delete duplicate items in the sorting array (simple); Sword finger offer 67. convert the string to an integer (medium); Interview question 01.08. zero matrix (simple)