当前位置：网站首页>Operation and maintenance specification: process template for online fault handling

Operation and maintenance specification: process template for online fault handling

2022-06-21 23:15:00 【Brother Xing plays with the clouds】

The handling process and documentation when the accident occurs .

Accident handling process

The basic principle of ： All means and actions taken during troubleshooting , Business recovery is the highest priority .

Process mechanism

After the fault is found ,On-Call Of SRE or Operation and maintenance , Fault commander Have the right to convene corresponding business development or other necessary resources , Rapid organization Accident handling team .
If the problem and recovery process are very clear , Fault commander Still SRE or Operation and maintenance , No transfer , It is up to him to direct everyone to do specific things , Give priority to business recovery .
If the problem is difficult , It's a big influence , At this time SRE A higher level supervisor can be asked to intervene , such as SRE Supervisor or director, etc , The general principle is who's business is most affected , Who will lead the organization . At this time SRE To put Fault commander Transfer responsibility to higher-level supervisors , If it is the influence of the whole station , Technology if necessary VP or CTO Can also bear Fault commander duty , Or authorize a director to undertake .
After the problem is solved , Functional verification is required .

Detailed flow chart

```sequence
OnCall Operation and maintenance -> fault : Find fault 
OnCall Operation and maintenance ->OnCall Operation and maintenance :  Preliminary analysis of fault causes 
OnCall Operation and maintenance -> Accident handling team :  Gather business development or other necessary resources 
 Accident handling team -> Accident handling team :  Accident feedback (10-15 Minutes at a time )
 Accident handling team -> Accident handling :  Accident investigation 
OnCall Operation and maintenance --> senior executive :  The problem is difficult , It's a big influence , Accident escalation 
 senior executive --> Accident handling team :  Full control , Proceed to the next step of negotiation 
 Accident handling -> Accident handling :  Recent releases 
 Accident handling -> Accident handling :  Services and infrastructure 
 Accident handling -> Accident handling :  Solve the problem 
 Accident handling -> Accident handling team :  Investigation record 
 fault -> Accident recovery :  Perform recovery verification 
 Accident recovery -> Accident handling team :  Notification of recovery results 
OnCall Operation and maintenance -> Post event summary :  Organize the fault recovery meeting 
Note right of  Post event summary :  Summarize the reasons , solve the problem 
 Post event summary -> Accident handling team :  Output meeting summary , Fault report 
```
COPY

Accident business phenomenon

Who reports what problems at what time , Try to be as detailed as possible , Like equipment id, user id etc.

Frequency of accidents

Episodic or Must appear

Accident recurrence method

Convenient for everyone to reproduce .

Accident time flow record

Record before the accident in the form of event time flow , Operation records in the accident
notes ： Time can be accurate

Recorder ： （ To be recorded by a designated person ）

Time	event	Operator	remarks
2021/09/28 12:20:20	take LB Bandwidth from 10Mb To 20Mb

Accident handling team

An accident group is organized by the accident responders . Easy to communicate .

Set up a special emergency group , Will these The key role of accident products Among them , When a fault occurs, it will be reported to the group as soon as possible .

Accident feedback

It is generally required to take the team as a unit , every other 10～15 Give feedback once a minute , Feedback on current processing progress and next steps Action, If something needs to be done in the middle , Also inform in advance , And the contents of the notification shall include the impact on the business and system , Finally by Fault commander Execute after making a decision , Avoid making mistakes while busy . No progress is progress , Also give timely feedback .

Accident investigation

Recently released information

Can include the last released system commitId, Time , Personnel, etc .

Test feedback

The feedback of the tester on the troubleshooting . It is convenient for developers to check problems .

Time	The test case	result	Recorder
9/28 11 P.m.	stay APP On the login	success	Zhang San
9/28 11 P.m.	Login on device	Failure	Li Si

Service situation

Each service in the team should have a corresponding owner. In case of online failure , Every owner Responsible for checking the service they are responsible for . The documents of the inspection process must retain evidence .

Time	service name	Inspection contents and results	Current state of	Examiner	remarks
9/28 10:30	echo	values Configure the correct version to apply ：v2.1cpu, Memory The year-on-year and month on month comparisons were normal The application configuration is correct ERROR Level from 9/28 7 Point sudden increase .		Zhang San

Infrastructure

The infrastructure team is responsible for checking the infrastructure .

Time	Components	Inspection content
9/28 10:00	LB	Bandwidth packet flow rate
9/28 10:00	NAT
9/28 10:00	Redis
9/28 10:00	PostgreSQL
9/28 11:00	Domain name resolution

Accident investigation records

“ hypothesis ” It refers to the assumption made by the troubleshooting personnel about the cause of the fault .
The purpose of this table is to prevent different people from repeatedly checking the same hypothesis . meanwhile , It is also convenient for others to verify .

Time	hypothesis	Investigation method	result	Check the person	remarks
9/28 10:00	There is a business logic error in the login phase

Accident recovery

Verification process after accident repair

Restore validation

Whether the business function is normal is verified by the test and the product .

Time	The test case	result	Recorder
9/28 11 P.m.	stay APP On the login	success	Zhang San
9/28 11 P.m.	Login on device	success	Li Si

Post event summary

The second round meeting

There must be a meeting , A chat .

Golden three questions ：

First question ： What are the causes of the failure ？
Second questions ： What do we do , What can be done to ensure that similar failures will not occur next time ？
Third questions ： If we did something , Business can be restored in less time ？