当前位置:网站首页>Operation and maintenance specification: process template for online fault handling

Operation and maintenance specification: process template for online fault handling

2022-06-21 23:15:00 Brother Xing plays with the clouds

The handling process and documentation when the accident occurs .

Accident handling process

The basic principle of : All means and actions taken during troubleshooting , Business recovery is the highest priority .

Process mechanism

  1. After the fault is found ,On-Call Of SRE or Operation and maintenance , Fault commander Have the right to convene corresponding business development or other necessary resources , Rapid organization Accident handling team .
  2. If the problem and recovery process are very clear , Fault commander Still SRE or Operation and maintenance , No transfer , It is up to him to direct everyone to do specific things , Give priority to business recovery .
  3. If the problem is difficult , It's a big influence , At this time SRE A higher level supervisor can be asked to intervene , such as SRE Supervisor or director, etc , The general principle is who's business is most affected , Who will lead the organization . At this time SRE To put Fault commander Transfer responsibility to higher-level supervisors , If it is the influence of the whole station , Technology if necessary VP or CTO Can also bear Fault commander duty , Or authorize a director to undertake .
  4. After the problem is solved , Functional verification is required .

Detailed flow chart

```sequence
OnCall Operation and maintenance -> fault : Find fault 
OnCall Operation and maintenance ->OnCall Operation and maintenance :  Preliminary analysis of fault causes 
OnCall Operation and maintenance -> Accident handling team :  Gather business development or other necessary resources 
 Accident handling team -> Accident handling team :  Accident feedback (10-15 Minutes at a time )
 Accident handling team -> Accident handling :  Accident investigation 
OnCall Operation and maintenance --> senior executive :  The problem is difficult , It's a big influence , Accident escalation 
 senior executive --> Accident handling team :  Full control , Proceed to the next step of negotiation 
 Accident handling -> Accident handling :  Recent releases 
 Accident handling -> Accident handling :  Services and infrastructure 
 Accident handling -> Accident handling :  Solve the problem 
 Accident handling -> Accident handling team :  Investigation record 
 fault -> Accident recovery :  Perform recovery verification 
 Accident recovery -> Accident handling team :  Notification of recovery results 
OnCall Operation and maintenance -> Post event summary :  Organize the fault recovery meeting 
Note right of  Post event summary :  Summarize the reasons , solve the problem 
 Post event summary -> Accident handling team :  Output meeting summary , Fault report 
```
COPY

Accident business phenomenon

Who reports what problems at what time , Try to be as detailed as possible , Like equipment id, user id etc.

Frequency of accidents

Episodic or Must appear

Accident recurrence method

Convenient for everyone to reproduce .

Accident time flow record

Record before the accident in the form of event time flow , Operation records in the accident

notes : Time can be accurate

Recorder : ( To be recorded by a designated person )

Time

event

Operator

remarks

2021/09/28 12:20:20

take LB Bandwidth from 10Mb To 20Mb

Accident handling team

An accident group is organized by the accident responders . Easy to communicate .

Set up a special emergency group , Will these The key role of accident products Among them , When a fault occurs, it will be reported to the group as soon as possible .

Accident feedback

It is generally required to take the team as a unit , every other 10~15 Give feedback once a minute , Feedback on current processing progress and next steps Action, If something needs to be done in the middle , Also inform in advance , And the contents of the notification shall include the impact on the business and system , Finally by Fault commander Execute after making a decision , Avoid making mistakes while busy . No progress is progress , Also give timely feedback .

Accident investigation

Recently released information

Can include the last released system commitId, Time , Personnel, etc .

Test feedback

The feedback of the tester on the troubleshooting . It is convenient for developers to check problems .

Time

The test case

result

Recorder

remarks

9/28 11 P.m.

stay APP On the login

success

Zhang San

9/28 11 P.m.

Login on device

Failure

Li Si

Service situation

Each service in the team should have a corresponding owner. In case of online failure , Every owner Responsible for checking the service they are responsible for . The documents of the inspection process must retain evidence .

Time

service name

Inspection contents and results

Current state of

Examiner

remarks

9/28 10:30

echo

values Configure the correct version to apply :v2.1cpu, Memory The year-on-year and month on month comparisons were normal The application configuration is correct ERROR Level from 9/28 7 Point sudden increase .

Zhang San

Infrastructure

The infrastructure team is responsible for checking the infrastructure .

Time

Components

Inspection content

Current state of

Examiner

remarks

9/28 10:00

LB

Bandwidth packet flow rate

9/28 10:00

NAT

9/28 10:00

Redis

9/28 10:00

PostgreSQL

9/28 11:00

Domain name resolution

Accident investigation records

“ hypothesis ” It refers to the assumption made by the troubleshooting personnel about the cause of the fault .

The purpose of this table is to prevent different people from repeatedly checking the same hypothesis . meanwhile , It is also convenient for others to verify .

Time

hypothesis

Investigation method

result

Check the person

remarks

9/28 10:00

There is a business logic error in the login phase

Accident recovery

Verification process after accident repair

Restore validation

Whether the business function is normal is verified by the test and the product .

Time

The test case

result

Recorder

remarks

9/28 11 P.m.

stay APP On the login

success

Zhang San

9/28 11 P.m.

Login on device

success

Li Si

Post event summary

The second round meeting

There must be a meeting , A chat .

Golden three questions :

  • First question : What are the causes of the failure ?
  • Second questions : What do we do , What can be done to ensure that similar failures will not occur next time ?
  • Third questions : If we did something , Business can be restored in less time ?

Preventive treatment

Since the meeting , There must be a prevention plan .

After the event Action

After the event action It can be combined with Kanban system , Easy to track .action Must be executable , accurate

Action

Executor

The verifier

Schedule completion time

Completion time

原网站

版权声明
本文为[Brother Xing plays with the clouds]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/172/202206212052144356.html