当前位置:网站首页>Is there a sudden failure on the line? How to make emergency diagnosis, troubleshooting and recovery
Is there a sudden failure on the line? How to make emergency diagnosis, troubleshooting and recovery
2022-07-05 04:22:00 【Alibaba cloud native】
author : Ten sleep
summary
Stability is greater than everything , Therefore, we need to have more effective ways to avoid online failures . Under the assumption that failure is inevitable , We need to be able to fix it quickly , Reduce online impact . Based on these ideas , We proposed 1-5-10 Fast recovery goal of , So-called 1-5-10 Our goal is to enable us to solve online problems 1 Minute discovery ,5 Minute positioning ,10 Minute repair . Next, we will introduce some information about fault recovery on Alibaba cloud 、 Some best practices for diagnostics .
This article excerpts from Microservice governance technology white paper 3.9 section
Follow the Alibaba cloud native official account , Backstage reply key words 【 Technical white paper 】 You can download it. !
1 Minute discovery
monitor
The role of monitoring can be summarized as : Find problems in the application , And alarm the problem in time to the technicians for treatment . Monitoring types can be divided into system problem monitoring and business problem monitoring , System problems : Common hardware and software related problems , For example, program exception , Memory fullGC etc. , Because there are no business characteristics , The monitoring strategy can be applied to various applications . The business problem : Problems defined under specific business scenarios , For example, there are no coupons for goods , Over issuance of rights and interests , The monitoring strategy needs to be customized according to the business characteristics .
Alibaba cloud real-time application monitoring service ARMS It can automatically discover and monitor common Web The framework and RPC frame , And count the call volume of the interface 、 response time 、 Number of errors and other indicators . At the same time, we can further obtain the slow speed of the interface SQL、MQ Stack analysis report or exception classification report , right and wrong 、 Slow and other common problems for more detailed analysis .
ARMS It also provides the ability of business monitoring , In a non intrusive way , Visually define business requests , Provide rich performance indicators and diagnostic capabilities that fit the business . A new way to measure application performance and stability from a business perspective , Carry out full link monitoring of key business transactions . Business monitoring tracks and collects business information in applications , Show business level indicators in real time , For example, the response time of the business 、 Times and error rate , It solves the problem that there is no mapping relationship between applications and business performance .
There are three requirements for monitoring . real time : The discovery and early warning of problems are required to be real-time , Reduce the time delay of problem generation and discovery ; accuracy : Monitoring and early warning are required to be accurate , Including the definition of monitoring problems , For the warning threshold , Early warning level , Allocation of responsible persons , Avoid false positives ; comprehensive : Early warning information is required to be comprehensive , It can help troubleshoot and solve problems .
“ No matter what happens to the application ,ARMS Can clearly show the problem in which line of code .ARMS It's very important for us , Greatly shorten the time to repair the fault , Significantly improved user experience . Since using ARMS, We can find and fix problems in time , Never be bothered by user complaints .” —— CR Vanguard
The alarm
When the monitoring finds something wrong , It is necessary to alarm the problem to technicians in time through different levels of alarms .ARMS Alarm management can improve the operation and maintenance efficiency of the system from the following points .
Integrated post event management is more efficient .
- Alarm management supports one click integration of common Alibaba cloud monitoring tools by default , And support more monitoring tools for manual access , Convenient for unified maintenance .
- Stable event access module , Can provide 7x24 Hours of uninterrupted event processing service .
- Low latency can be guaranteed when processing massive event data .
Notify the contact person of the alarm timely and accurately .
- Configure notification rules , Send alarm notification after merging events , Reduce the notification fatigue of operation and maintenance personnel .
- Select the email according to the urgency of the alarm 、 SMS 、 Telephone 、 Different notification methods such as nailing , To remind the contact to deal with the alarm .
- The alarm that has not been processed for a long time is reminded many times through the upgrade notification , Ensure that the alarm is solved in time .
Help you manage alarms quickly and easily .
- The contact person can handle the alarm at any time by nailing .
- Use the general alarm format , Contacts can better analyze alarms .
- Multiple contacts are processed by nailing .
Statistics of alarm data , Real time analysis and processing , Improve alarm processing efficiency .
5 Minutes to locate the fault
Service instance isolation and diagnosis
In the online microservice scenario , When exceptions occur in some instances of the service provider , One side , Service consumers need to avoid accessing abnormal instances , On the other hand , It is necessary to keep the abnormal scene , Facilitate subsequent troubleshooting . For another consideration , We all know Dump Memory will affect the performance of our applications to a certain extent , It may affect our online business , Can we be in Dump Isolate the traffic from the instance before memory .MSE The service instance isolation and diagnosis function of the governance center can help us isolate the traffic of abnormal instances , On the one hand, it supports the isolation of traffic from microservices , On the other hand, support will come from K8s Service To isolate the flow , It can completely isolate the business traffic in the production environment , Then we can apply real-time monitoring services in combination with Alibaba cloud ARMS Memory snapshot generation capability provided , Generate online environment memory snapshots of abnormal instances in time , Help us with subsequent problem analysis and diagnosis . The service instance isolation and diagnosis function can help us deal with sudden online accidents ( Such as memory leakage ), Improve the overall stability of the microservice system .
- practice
We can do it in MSE The service governance console sees the online instance list .
Select a specific exception instance , Conduct service offline operation , Remove the instance from the registry , At the same time, if we configure MSE Ready to check probe provided , Will also come from K8s Service To isolate the flow , We can check whether the corresponding instance is successfully offline in the event center .
After offline operation , We can go through MSE The second level node monitoring provided can see whether there is still traffic . Wait until the flow stops completely , Alibaba cloud application monitoring service ARMS Provides the function of creating memory snapshots , Create a memory snapshot for the exception instance , For further troubleshooting .
Click the create memory snapshot button , Create memory snapshots .
Click Save to create a snapshot task .
After clicking save, we see 172.16.0.200 This instance already has a snapshot
Further according to the prompt of the console , We will respectively Core File Dump 、 analysis 、 see
Click to view , Automatically jump to Grace Analyzed pages , We can see the overview of memory analysis , Memory leak report 、 Class loader and a series of information . View the details of memory usage through detailed memory analysis data , So as to further troubleshoot memory problems such as memory leakage and memory waste .
Last mention , We can restore the isolated traffic through the service online .
Arthas The diagnosis
Arthas It's a diagnosis Java The sharp weapon of online problems in the field , Using bytecode enhancement technology , Can be restarted without JVM In the case of process , Check the operation of the program .
- JVM overview
JVM The overview supports viewing applications JVM Related information , Include JVM Memory 、 Operating system information 、 Variable information, etc , Help us understand JVM The general situation of .
1、JVM Memory :JVM Memory related information , Including heap memory usage 、 Non heap memory usage 、GC Situation, etc .
2、 Operating system information : Information about the operating system , Including average load , Operating system name 、 Operating system version 、Java Version, etc .
3、 Variable information : Information about variables , Including system variables and environment variables .
- Thread time analysis
Thread time analysis supports displaying all threads of the application and viewing the stack information of threads , Help us quickly locate time-consuming threads .
1、 The thread time analysis tab will get the current in real time JVM The thread time of the process , And aggregate similar threads . You can view the thread ID、CPU Usage and status .
2、 We can use the operation column on the right side of the target thread , Click view live stack .
- Method execution analysis
Method execution analysis supports capturing the time-consuming execution of a method 、 Enter the reference 、 Return value and other information and drill in , Help you quickly locate the root cause of slow calls , And scenarios such as problems that cannot be reproduced offline or logs are missing .
As shown in the figure below , The execution time of each internal method will be displayed in the source code in the form of comments .
- Object viewer
The object viewer is used to view the current state of some singleton objects , Used to troubleshoot abnormal application status , For example, application configuration 、 Black and white list 、 Member variables, etc .
- Real time Kanban
Real time Kanban is used to view the real-time status of key components used in the system , For example, check the usage of database connection pool 、HTTP Usage of connection pool , It is conducive to the troubleshooting of resource types .
As shown in the figure below, it is a Druid Real time status information of connection pool , Including basic configuration 、 Connection pool status 、 Execution time distribution, etc .
- Performance analysis
Performance analysis supports CPU Time consuming 、 Memory allocation and other objects are sampled for a certain time and the corresponding flame diagram is generated , Help you quickly locate the performance bottleneck of your application .
10 Minutes to recover
To remove from a crowd
In the microservices architecture , When some instances of the service provider's application are abnormal , When the service consumer cannot perceive it, it will affect the normal invocation of the service , And affect the service performance and even usability of consumers . The outlier instance removal function will detect the availability of application instances and make dynamic adjustments , To ensure that the service is successfully invoked , So as to improve the stability of the business and the quality of service .
Service breakdown and degradation
When applications encounter business peak , It is found that downstream service providers encounter performance bottlenecks , Even when it is about to affect the business . We can fuse some service consumers , Automatic fusing for persistent unstable calls , So as to improve the stability of the overall service . When the downstream services that the application depends on are unavailable , Cause loss of business flow . You can configure the service fusing capability , When the downstream service is abnormal , Service degradation enables traffic to flow on the caller “fail fast”, Effectively prevent avalanches .
At the peak of business , Some downstream service providers are experiencing performance bottlenecks , Even affect the business . We configure automatic fusing for some non critical service consumers , When the slow call ratio or error ratio within a period of time reaches a certain condition, the fuse will be automatically triggered , In the following period, the service call returns directly Mock Result , This can ensure that the caller is not dragged down by unstable services , It can also serve the unstable downstream “ Breathe ” Time for , At the same time, it can ensure the normal operation of the entire service link .
Other scenes , Service degradation can help us ensure some important services . Some non critical services are not very stable , We hope to demote these weak dependent service calls temporarily before important activities , Reserve resources for other core services , So as to ensure the smoothness of the overall business .
Outlier instance removal and service fusing 、 Service degradation is mainly reflected in two points :
1、 Done automatically : Service degradation is an operation and maintenance action , It needs to be configured through the console , And specify the corresponding service name to achieve the corresponding effect ; And outlier instances are removed 、 Service fusing capability is to actively detect the survival of upstream nodes or successful exceptions of service calls 、 Slow call, etc , Do automatic isolation or fusing operation on this link , Guarantee the quality of service .
2、 Remove grain size : The service degradation is ( service + node IP), With Dubbo For example , A process will publish with the service interface name (Interface) Microservice named for service , If the degradation of this service is triggered , This service of this node will not be called next time , But it still calls other services . But the whole node will not try to call when the outlier instance is removed .
Flow control 、 Capacity expansion 、 restart 、 Roll back
- flow control : According to the flow 、 Number of concurrent threads 、 Response time and other indicators , Adjust the random flow to a suitable shape , That is, flow shaping . Through flow control capability , Configure flow control rules for service interfaces , Let requests within the capacity range pass , Redundant requests rejected , Equivalent to the role of airbag . Layers of protection , stay Nginx/Ingress The gateway layer performs coarse-grained protection , At the microservice layer API、 Interface 、 Method 、 Parameter granularity control . Avoid applications being overwhelmed by transient traffic peaks , So as to guarantee the high availability of the application .
- Capacity expansion : Horizontal horizontal expansion improves cluster availability
- restart : Restart JVM process , So as to temporarily eliminate the problems accumulated during long-term operation, such as memory leakage
- Roll back : Eliminate problems introduced by changes
Based on the priority of the same availability zone, one click streaming
The characteristic of the same city is RT Generally, the delay is at the bottom (< 3ms within ), So by default , We can build a large LAN based on different available areas in the same city , Then distribute our application across multiple zones , In order to cope with the failure of a single availability zone, we can better control the influence surface of the failure .
MSE Service governance provides the ability of preferential routing with the computer room at the service framework level , If the target service is the same as your own zone , Then the traffic is preferentially routed to nodes in the same zone as the current one . When an availability zone is unavailable , We only need to cut the traffic at the gateway , Isolate the flow of the failed availability zone , That is, we can resume our business immediately .
tail
1-5-10 Fault fast recovery , fault 1 Minute response 、5 Minute positioning 、10 Minutes to recover ; Only by constantly failing to design 、 Drill based on fault emergency mode , Then when we really encounter an online fault, we can face the fault more calmly . We hope that the new generation of cloud native microservices can have more system self-healing capabilities , Microservice architecture can automatically sense the failure of external components , Automatically switch to the standby link , Really nip the fault in the cradle .
MSE First purchase of professional edition of registration configuration center 9 A discount ,MSE Full specification of cloud native gateway prepaid 9 A discount .
Click on here , Enjoy the discount !
边栏推荐
- How to remove installed elpa package
- 机器学习 --- 决策树
- Phpmailer reported an error: SMTP error: failed to connect to server: (0)
- [thingsboard] how to replace the homepage logo
- mysql的七种join连接查询
- Looking back on 2021, looking forward to 2022 | a year between CSDN and me
- How to solve the problem that easycvr changes the recording storage path and does not generate recording files?
- Clickpaas low code platform
- Alibaba cloud ECS uses cloudfs4oss to mount OSS
- Uni app change the default component style
猜你喜欢
小程序中实现文章的关注功能
How to solve the problem that easycvr changes the recording storage path and does not generate recording files?
电源管理总线 (PMBus)
【虚幻引擎UE】实现背景模糊下近景旋转操作物体的方法及踩坑记录
Mxnet imports various libcudarts * so、 libcuda*. So not found
Use threejs to create geometry and add materials, lights, shadows, animations, and axes
陇原战“疫“2021网络安全大赛 Web EasyJaba
How does the applet solve the rendering layer network layer error?
Sequence diagram of single sign on Certification Center
【FineBI】使用FineBI制作自定义地图过程
随机推荐
Sequence diagram of single sign on Certification Center
这是一个不确定的时代
FFmepg使用指南
Convert Boolean to integer value PHP - Convert Boolean to integer value PHP
行为感知系统
【虚幻引擎UE】实现背景模糊下近景旋转操作物体的方法及踩坑记录
mysql的七种join连接查询
如何实现实时音视频聊天功能
American 5g open ran suffered another major setback, and its attempt to counter China's 5g technology has failed
level18
A應用喚醒B應該快速方法
托管式服务网络:云原生时代的应用体系架构进化
Rome chain analysis
[popular science] basic knowledge of thermal design: heat dissipation analysis of 5g optical devices
The development of mobile IM based on TCP still needs to keep the heartbeat alive
A real day for Beijing programmers!!!!!
Introduction to RT thread kernel (5) -- memory management
美国5G Open RAN再遭重大挫败,抗衡中国5G技术的图谋已告失败
概率论与数理统计考试重点复习路线
A application wakes up B should be a fast method