当前位置:网站首页>Operation and maintenance specification: process template for online fault handling
Operation and maintenance specification: process template for online fault handling
2022-06-21 23:15:00 【Brother Xing plays with the clouds】
The handling process and documentation when the accident occurs .
Accident handling process
The basic principle of : All means and actions taken during troubleshooting , Business recovery is the highest priority .
Process mechanism
- After the fault is found ,On-Call Of SRE or Operation and maintenance , Fault commander Have the right to convene corresponding business development or other necessary resources , Rapid organization Accident handling team .
- If the problem and recovery process are very clear , Fault commander Still SRE or Operation and maintenance , No transfer , It is up to him to direct everyone to do specific things , Give priority to business recovery .
- If the problem is difficult , It's a big influence , At this time SRE A higher level supervisor can be asked to intervene , such as SRE Supervisor or director, etc , The general principle is who's business is most affected , Who will lead the organization . At this time SRE To put Fault commander Transfer responsibility to higher-level supervisors , If it is the influence of the whole station , Technology if necessary VP or CTO Can also bear Fault commander duty , Or authorize a director to undertake .
- After the problem is solved , Functional verification is required .
Detailed flow chart
```sequence
OnCall Operation and maintenance -> fault : Find fault
OnCall Operation and maintenance ->OnCall Operation and maintenance : Preliminary analysis of fault causes
OnCall Operation and maintenance -> Accident handling team : Gather business development or other necessary resources
Accident handling team -> Accident handling team : Accident feedback (10-15 Minutes at a time )
Accident handling team -> Accident handling : Accident investigation
OnCall Operation and maintenance --> senior executive : The problem is difficult , It's a big influence , Accident escalation
senior executive --> Accident handling team : Full control , Proceed to the next step of negotiation
Accident handling -> Accident handling : Recent releases
Accident handling -> Accident handling : Services and infrastructure
Accident handling -> Accident handling : Solve the problem
Accident handling -> Accident handling team : Investigation record
fault -> Accident recovery : Perform recovery verification
Accident recovery -> Accident handling team : Notification of recovery results
OnCall Operation and maintenance -> Post event summary : Organize the fault recovery meeting
Note right of Post event summary : Summarize the reasons , solve the problem
Post event summary -> Accident handling team : Output meeting summary , Fault report
```
COPYAccident business phenomenon
Who reports what problems at what time , Try to be as detailed as possible , Like equipment id, user id etc.
Frequency of accidents
Episodic or Must appear
Accident recurrence method
Convenient for everyone to reproduce .
Accident time flow record
Record before the accident in the form of event time flow , Operation records in the accident
notes : Time can be accurate
Recorder : ( To be recorded by a designated person )
Time | event | Operator | remarks |
|---|---|---|---|
2021/09/28 12:20:20 | take LB Bandwidth from 10Mb To 20Mb | ||
Accident handling team
An accident group is organized by the accident responders . Easy to communicate .
Set up a special emergency group , Will these The key role of accident products Among them , When a fault occurs, it will be reported to the group as soon as possible .
Accident feedback
It is generally required to take the team as a unit , every other 10~15 Give feedback once a minute , Feedback on current processing progress and next steps Action, If something needs to be done in the middle , Also inform in advance , And the contents of the notification shall include the impact on the business and system , Finally by Fault commander Execute after making a decision , Avoid making mistakes while busy . No progress is progress , Also give timely feedback .
Accident investigation
Recently released information
Can include the last released system commitId, Time , Personnel, etc .
Test feedback
The feedback of the tester on the troubleshooting . It is convenient for developers to check problems .
Time | The test case | result | Recorder | remarks |
|---|---|---|---|---|
9/28 11 P.m. | stay APP On the login | success | Zhang San | |
9/28 11 P.m. | Login on device | Failure | Li Si | |
Service situation
Each service in the team should have a corresponding owner. In case of online failure , Every owner Responsible for checking the service they are responsible for . The documents of the inspection process must retain evidence .
Time | service name | Inspection contents and results | Current state of | Examiner | remarks |
|---|---|---|---|---|---|
9/28 10:30 | echo | values Configure the correct version to apply :v2.1cpu, Memory The year-on-year and month on month comparisons were normal The application configuration is correct ERROR Level from 9/28 7 Point sudden increase . | Zhang San | ||
Infrastructure
The infrastructure team is responsible for checking the infrastructure .
Time | Components | Inspection content | Current state of | Examiner | remarks |
|---|---|---|---|---|---|
9/28 10:00 | LB | Bandwidth packet flow rate | |||
9/28 10:00 | NAT | ||||
9/28 10:00 | Redis | ||||
9/28 10:00 | PostgreSQL | ||||
9/28 11:00 | Domain name resolution |
Accident investigation records
“ hypothesis ” It refers to the assumption made by the troubleshooting personnel about the cause of the fault .
The purpose of this table is to prevent different people from repeatedly checking the same hypothesis . meanwhile , It is also convenient for others to verify .
Time | hypothesis | Investigation method | result | Check the person | remarks |
|---|---|---|---|---|---|
9/28 10:00 | There is a business logic error in the login phase | ||||
Accident recovery
Verification process after accident repair
Restore validation
Whether the business function is normal is verified by the test and the product .
Time | The test case | result | Recorder | remarks |
|---|---|---|---|---|
9/28 11 P.m. | stay APP On the login | success | Zhang San | |
9/28 11 P.m. | Login on device | success | Li Si | |
Post event summary
The second round meeting
There must be a meeting , A chat .
Golden three questions :
- First question : What are the causes of the failure ?
- Second questions : What do we do , What can be done to ensure that similar failures will not occur next time ?
- Third questions : If we did something , Business can be restored in less time ?
Preventive treatment
Since the meeting , There must be a prevention plan .
After the event Action
After the event action It can be combined with Kanban system , Easy to track .action Must be executable , accurate
Action | Executor | The verifier | Schedule completion time | Completion time |
|---|---|---|---|---|
边栏推荐
- C # error: the exception of the task is not observed by waiting for the task or accessing the exception attribute of the task. As a result, the finalizer thread re threw an unobserved exception.
- H5之微信授权登陆 (uniapp网页版微信授权登录)
- Solve the problem that the letter of a key in laptop (I) cannot be pressed
- Uwp confirms whether there is pop-up display
- 开发环境和测试环境的发包(及uniapp的request封装)
- Wechat applet obtains network status
- Pychart User Guide
- Buckle 75: color split
- 【深入理解指针】指针的进阶
- 在小程序的 wxml 文件中使用 js 函数
猜你喜欢

Sigir2022 𞓜 user preference modeling in conversational recommendation system

numpy矩阵初等变换

. File header parsing of BMP pictures

Uniapp encapsulates the request function to achieve unique login. One account can only log in to one device at the same time

danfoss丹佛斯变频器维修VLT5000/VLT6000/VLT8000

深度学习预测酶活性参数提升酶约束模型构建从头环境搭建

H5 wechat authorized login (wechat authorized login of the uniapp web version)

SIGIR2022 | 對話式推薦系統中的用戶偏好建模

【深入理解指针】指针的进阶

Translation software Bob installation tutorial
随机推荐
Precautions for using keep alive
牛客月賽-環上食蟲
使用云开发实现微信支付的具体方法
Swiftui basic learning journal (XI) SQLite data operation
Getting to know Vxe table (I)
Common options and commands of Synplify Pro
keep-alive的使用注意点
Library white paper
Uwp confirms whether there is pop-up display
KVM virtual machine online disk expansion -- the road to dream
语音断点检测(短时改进子带谱熵)
WPF select Folder
语音信号处理之多阶MFCC提取(matlab)
Multi order MFCC extraction for speech signal processing (matlab)
danfoss丹佛斯变频器维修VLT5000/VLT6000/VLT8000
Readjustment of move protocol beta to expand the total prize pool
必讀書籍
WPF thread manipulation UI problem
uniapp在解决谷歌浏览器跨域问题,在谷歌浏览器运行
WPF x:Static