当前位置:网站首页>Troubleshooting ideas that can solve 80% of faults
Troubleshooting ideas that can solve 80% of faults
2022-07-02 18:27:00 【androidstarjack】
Click on the top “ Terminal R & D department ”
Set to “ Star standard ”, Master more database knowledge with you
Explaining the event 、 Before troubleshooting , Let's start with a fault scenario ( Take the call center system as an example ):
The business personnel reported that the call center system was running slowly , Part of the phones in the self-service language link system processing timeout , Traffic transfer to manual seat , There is a wire burst in the manual seat .
The operation and maintenance personnel are busy , Check resource usage 、 Check whether the service is normal 、 Check whether the log reports an error 、 Check whether the trading volume still has …… Time is unconsciously typing on the keyboard 、 Tap the keyboard 、 Hit the keyboard , But the reason has not been located .
The manager came to know the situation :“ Has the system recovered ?”、“ What is the effect of the failure ?”、“ Is the transaction interrupted ?”……
The operation and maintenance personnel quickly hit the keyboard , Write SQL, Look at the trading volume ; Tap the keyboard , Write orders , Look at system resources 、 situation ……
Final , The problem is located because one of the functions does not control the number of returns , Causing a memory leak .
For this fault , The business hopes that the operation and maintenance can solve the fault recovery faster , The manager wants to develop and optimize the call center fault handling process , I did the following things :
Time of priority fault handling process :” Work that can be done with the mouse , Don't use the keyboard “
Find the fault in advance , Strengthen monitoring :“ Technology finds problems before business , Monitoring is more than just an alarm , Also assist in fault location ”
Improve the fault emergency plan :“ The emergency plan is up-to-date 、 accurate 、 Simple and clear ”
Long term goals : Fault self healing :“ Operation automation that can cure , Let the machine do what the machine can do ”
The following will start with the common troubleshooting methods , Then start from the preparation before the fault ( Perfect monitoring 、 Formulate emergency plans, etc ) To solve the problems raised by the manager , And put forward the idea of solving the fault in the future .
One 、 Common methods
1、 Determine the fault phenomenon and preliminarily determine the influence of the problem
Before dealing with the fault , The operation and maintenance personnel should first know the fault phenomenon , The fault phenomenon directly determines the formulation of fault emergency plan , This depends on the fact that the operation and maintenance personnel need to be familiar with the overall function of the application system .
After confirming the fault phenomenon , To guide the operation and maintenance personnel to initially judge the impact of the fault .
2、 Emergency recovery
The most basic indicator of operation and maintenance is system availability , The timeliness of emergency recovery is the key index of system availability .
With the judgment of the above fault phenomena and effects , The emergency operation plan can be formulated , There are a lot of fault emergency , such as :
The overall service performance is degraded or abnormal , Consider restarting the service ;
The application has changed , You can consider whether you need to change back ;
Insufficient resources , Emergency expansion can be considered ;
Application performance problems , You can consider adjusting the application parameters 、 Log parameters ;
Database busy , Consider database snapshot analysis , Optimize SQL;
Error in application function design , Consider the emergency shutdown function menu ;
There are still a lot of it ……
in addition , What needs to be added is , Before failure emergency , If conditions permit, you need to save the current system scenario , For example, before killing the process , You can catch one first CORE File or database snapshot file .
3、 Quickly locate the cause of the fault
1) Whether it is accidental 、 Is it reproducible
Whether the fault phenomenon can be reproduced , It's important to solve problems quickly , Can reproduce that there will always be ways or tools to help us locate the cause of the problem , Moreover, the faults that can be reproduced may often be service exceptions 、 Problems caused by changes and other work .
But if the fault is accidental , There is a very small probability , It is more difficult to check , This depends on whether the system has enough on-site information during the failure to determine whether the cause can always be located .
Whether relevant changes have been made .
Most failures are caused by changes , After determining the fault phenomenon , If there are changes that should be made , It is helpful to analyze whether it is caused by change from the perspective of change , Then quickly locate the fault and prepare emergency plans such as failback .
2) Whether relevant changes have been made
Most failures are caused by changes , After determining the fault phenomenon , If there are changes that should be made , It is helpful to analyze whether it is caused by change from the perspective of change , Then quickly locate the fault and prepare emergency plans such as failback .
3) Whether the scope can be reduced
On the one hand, the application system advocates decoupling , A trade fair flows through different application systems and modules ; On the other hand , The failure may be due to the application 、 Systems software 、 Hardware 、 Problems in network and other links . Comprehensive troubleshooting should be avoided when troubleshooting the causes of faults , It is suggested to narrow down the scope of the problem to a certain procedure before coordinating the related team for troubleshooting .
Related parties cooperate to analyze problems
With the first (3) At the same time, it is necessary to avoid that all related teams have no clue at the same time , The leader needs to be open to ask related parties to cooperate in positioning after narrowing the scope , For related parties, they need to have a positive working attitude .
4) Are there enough logs
Locating fault cause , The most common method is to analyze the application log , The operation and maintenance personnel not only need to know which service process the business function corresponds to , You also need to know which application logs correspond to this service process , It also has the ability to judge some simple application log abnormal errors .
5) Is there a core or dump Wait for the documents
The system site during failure is very important , It is suggested to leave the documents of the system site if possible before the failure emergency , such as CORE\DUMP, or TRACE Collect information, etc , Back up some logs that may be overwritten .
The above are common methods for general faults , In case of major failure or failure handled by multiple parties , Often, small-scale troubleshooting is not conducive to rapid resolution , The process of emergency treatment needs to be started , It is suggested to consider the following communication :
Gather relevant personnel
Describe the fault status
Explain the normal application logic flow
State changes
Investigation progress , Display information
Leading decision-making
Two 、 Perfect monitoring
1、 Improve the monitoring and visualization
A perfect monitoring strategy needs a unified visual operation interface , After formulating a sound monitoring strategy , Fault handling personnel need to be able to quickly see the corresponding operation data , such as : Can see the trend over time 、 Data performance during failure 、 Performance analysis and other data , And these data can formulate strategies in advance and directly launch the analysis results to the fault handling personnel , This greatly improves the fault processing efficiency , Take the call center system as an example , The following real-time transaction data should be configured in advance , For fault location :
Transaction performance data : Average transaction time 、 The transaction time of the internal module of the system is long (IVR The transaction takes 、 Interface bus transactions take time )、 Related system transactions take time ( Core transactions take time 、 Transaction time of work order system, etc )
Important transaction indicator data : Trading volume 、IVR Trading volume 、 Traffic 、 Seat call rate 、 Number of core transactions 、 Transaction volume of work order and other systems
Transaction exception data : Deal success rate 、 Failure rate 、 The maximum number of error codes is
Analyze transaction data by server : Press server Count the number of transactions processed by each service , The transaction always takes time
With the above transaction data , And make statistics according to a certain frequency through monitoring , Operation and maintenance personnel in case of failure , You can see when the fault starts by clicking the mouse , Is there a problem in the system or in the associated system , Which is the most prominent deal , Whether the transaction volume of each server is balanced .
2、 Improve from the aspect of monitoring
The most basic work of monitoring is to realize the load balancing equipment 、 Network devices 、 The server 、 The storage device 、 Safety equipment 、 database 、 Middleware and application software, etc IT Comprehensive monitoring and management of resources . In the monitoring of application software , Not just a service process 、 Port monitoring , There needs to be business 、 Monitoring at the transaction level .
Comprehensive application monitoring can give early warning of faults , And save the data affecting the application running environment , To shorten the troubleshooting time .
3、 Improve from monitoring alarm
A perfect monitoring strategy requires clear monitoring alarm prompt , The personnel on duty can make a simple problem location and emergency treatment plan according to the monitoring alarm . For example, the following monitoring SMS :
22 when ,【 Financial management application system 】 in 【 application server LC_APPsvrA 10.2.111.111】 Of 【 Front application module 】 appear 【 Application port :9080】 non-existent , This port is used for 【 Provide financial application processing ( Load balancing deployment )】, The reason may be 【SERVER1 Service stopped abnormally 】, The monitoring system has carried out the following emergency treatment 【 Auto execute port process startup 】, The urgency of the incident 【 high 】.
The administrator can see which system through the SMS content 、 Which application 、 What's wrong with which module , What might be the reason , What's the impact on the business , Whether it needs to be handled immediately ( For example, if this alert appears in the early morning, can it be delayed to the next day ) Etc .
4、 Improve from monitoring and analysis
A perfect monitoring strategy not only needs real-time data alarm , There should also be an analysis alarm for summarizing data , The importance of alarms for real-time data analysis goes without saying , For the data collected and analyzed, potential risks can be found , At the same time, it also provides help for the analysis of difficult and miscellaneous diseases .
5、 Improve the monitoring initiative
Monitoring is more than just an alarm , It can do more , As long as we find a way to give it the rules to actively solve events , It has the ability to handle faults for Administrators .
3、 ... and 、 Emergency plan
It is necessary to make a fault emergency plan in advance , But in the process of daily work, our emergency plan encounters some problems :
The emergency plan lacks continuous maintenance , Lack of drills , Information is not timely 、 inaccurate ;
The emergency plan is too big and comprehensive , It is not conducive to reading and using ;
The form of emergency plan is greater than the actual effect , The program is not targeted ;
Focus only on the content of the emergency plan , But they didn't pay attention to the understanding of the operation and maintenance personnel on the scheme ;
For the above common problems , The emergency plan needs to do the following :
1、 The content is concise
Many people may think that there are various forms of failure , Therefore, the emergency plan needs to involve all aspects . But in the actual fault handling process , We can find that our emergency measures often reuse several common steps , So I think the emergency plan should be focused , If an emergency plan can deal with ordinary faults 80% Scene , Then this emergency manual should be qualified . Too much pursuit of content that affects all aspects of the application system , This will lead to poor readability of the scheme , Finally change a document to be checked . The following is what I think the application system emergency plan should have :
1) The system level
Know the role of the current application system in the whole transaction , When there are problems in the current system or problems in the upstream and downstream , You can know how to cooperate with upstream and downstream analysis of problems , such as : How the upstream and downstream systems communicate , Whether the communication has a unique keyword, etc .
in addition , Some basic emergency operations are also involved at the system level , Such as capacity 、 System and network parameter adjustment, etc .
2) Service level
Can know what business this service affects , Logs involved in the service 、 Program 、 Where is the configuration file , How to check whether the service is normal , How to restart the service , How to adjust application level parameters, etc .
3) Transaction level
Can know how to find out that there is a problem with a certain transaction or a certain type of transaction , It's a large area 、 Local , Or an occasional problem , Data can be used to illustrate the impact of transactions , Can locate the information of transaction error report . The most commonly used method here is database query or the use of tools .
Know how to check whether the most important transaction is normal , Emergency response plan for important scheduled tasks , For example, opening 、 Change day 、 Time requirements for reconciliation and emergency measures .
4) Use of auxiliary tools
occasionally , Some tools or automated tools are needed to assist in analysis and emergency response , At this time, we need to have the method of how to use auxiliary tools .
5) Communication plan
Address book involved in communication scheme , Including upstream and downstream systems 、 Third party units 、 Business departments and other channels .
6) Other
Above 5 How complete the content is , I believe this emergency manual can solve 80% Fault recovery work .
2、 Emergency plan is an ongoing work
With an emergency plan , How to make the operation and maintenance personnel update continuously is the difficulty . I think we should solve this difficulty , The operation and maintenance personnel need to use this manual frequently first . If a manual has no scenes to use , Then managers need to create opportunities for operation and maintenance personnel to use this manual , Such as emergency drill .
3、 Pay attention to the understanding of operation and maintenance personnel on key application information
The first two points focus on the manual , Finally, I think it is necessary to pay attention to the people who use this manual . Some operation and maintenance personnel think that the application operation and maintenance personnel do not have the ability to understand the content of the application system itself very thoroughly , Therefore, the position of application operation and maintenance personnel in the process of fault handling is very embarrassing , The operation and maintenance personnel have the right to operate , But I don't know what to do .
Regarding this , I agree that application operation and maintenance personnel do not need to master the business functions of the application system , But I think for the application system itself, the application operation and maintenance personnel need to have the following basic abilities :
Know what the application system does , What is the basic business ;
Know the application architecture deployment 、 Logical relationship between upstream and downstream systems ;
Know the role of services under the application 、 port 、 Service level emergency treatment , How to find and simply locate data information such as logs ;
Know the important time points and tasks of the application system , For example, opening 、 Shut down 、 Change day 、 The time point of scheduled tasks and how to judge whether these tasks are correct ;
Know the process of the most important transactions ;
Know the common database table structure , And can use .
Four 、 Intelligent event processing
The processing method is shown in the figure below ( Detailed intellectualization involves monitoring 、 Rules engine 、 Configuration tool 、CMDB、 Application configuration library and other modules work together ).
Author programmer interview
Source: Website :https://blog.csdn.net/Dou_Hua_Hua/article/details/108829245
Write it at the end
You can leave a message below to discuss what you don't understand , You can also ask me by private message. I will reply after seeing it . Finally, if the article is helpful to you, please remember to give me a like , Pay attention and don't get lost
@ Terminal R & D department
Fresh dry goods are shared every day !
reply 【idea Activate 】 You can get idea How to activate
reply 【Java】 obtain java Relevant video tutorials and materials
reply 【SpringCloud】 obtain SpringCloud Many relevant learning materials
reply 【python】 Get the full set 0 Basics Python Knowledge Manual
reply 【2020】 obtain 2020java Related interview questions tutorial
reply 【 Add group 】 You can join the technical exchange group related to the terminal R & D department
Read more
use Spring Of BeanUtils front , I suggest you understand these pits first !
lazy-mock , A lazy tool for generating backend simulation data
In Huawei Hongmeng OS Try some fresh food , My first one “hello world”, take off !
The byte is bouncing :i++ Is it thread safe ?
One SQL Accidents caused by , Colleagues are fired directly !!
Too much ! Check Alibaba cloud ECS Of CPU Incredibly reach 100%
a vue Write powerful swagger-ui, A little show ( Open source address attached )
Believe in yourself , Nothing is impossible , Only unexpected, not only technology is obtained here !
If you like, just give me “ Looking at ”
边栏推荐
- [Oracle final review] addition, deletion and modification of tablespaces, tables, constraints, indexes and views
- A4988与42步进电机
- 如何开启IDEA的Run Dashboard功能
- Wasserstein slim gain with clipping penalty (wsgain-cp) introduction and code implementation -- missing data filling based on generating countermeasure network
- Explain kubernetes network model in detail
- Installation tutorial and simple call of Matplotlib
- 1288_FreeRTOS中vTaskResume()接口以及中断安全版本接口实现分析
- Editor编辑器扩展在Scene View添加按钮和logo
- Embedded development board ~ description
- Please, stop painting star! This has nothing to do with patriotism!
猜你喜欢
[Northwestern Polytechnic University] information sharing of the first and second postgraduate examinations
求求你们,别再刷 Star 了!这跟“爱国”没关系!
ESP32-C3入门教程 问题篇⑩——error: implicit declaration of function ‘esp_blufi_close‘;
Steamos 3.3 beta release, steam deck Chinese keyboard finally came
切换变换的时候记得使用三元表达式
再放寬!這些應届生,可直接落戶上海
Qt官方示例:Qt Quick Controls - Gallery
[golang | grpc] generate certificates using OpenSSL
Leetcode 面试题 16.15. 珠玑妙算
Wechat applet video sharing platform system graduation design completion (6) opening defense ppt
随机推荐
巴比特 | 元宇宙每日必读:一千块就能买一个虚拟主播?这是小企业的直播福音还是在“割韭菜”?...
MySQL installation and configuration
matplotlib的安装教程以及简单调用
Aptos教程-参与官方激励测试网(AIT2 激励测试网)
[games101] operation 4 B é zier curve
977. Square of ordered array
719. 找出第 K 小的数对距离
1.5.1版本官方docker镜像运行容器,能设置使用 mysql 8驱动吗?
vimium映射鍵
UE4 用spline画正圆
Is Guojin securities a state-owned enterprise? Is it safe to open an account in Guojin securities?
Esp32-c3 introductory tutorial question ⑩ - error: implicit declaration of function 'ESP_ blufi_ close‘;
好玩的免费GM游戏整理汇总
自定义一个loading指令
Paddlepaddle 28 build an automatic coder based on convolution
Customize a loading instruction
A4988 and 42 stepper motors
Architecture design - ID generator "suggestions collection"
Unity learning shader notes [82] black and white processing of enhanced single channel color rendering
Leetcode 面试题 16.15. 珠玑妙算