当前位置:网站首页>Some thoughts on how to do a good job of operation and maintenance management

Some thoughts on how to do a good job of operation and maintenance management

2022-06-11 18:14:00 wx61eaae213a986


9 At the end of the month , The two systems in our team's charge had two production failures on the line in a few weeks , Although there was no serious loss in the end , But the leaders have to put forward some higher requirements , around ​​ Keep safe and stable , Avoid reoccurrence of the fault ​​ This goal needs to sort out various possible optimization measures , I also take this opportunity to sort out my views on how to do a good job in operation and maintenance management , You are welcome to make comments and suggestions .

The operation and maintenance mentioned in this article refers to the operation and maintenance of the application system , With the traditional Linux Operation and maintenance 、 Different database operation and maintenance , Application system operation and maintenance is more about whether an online business system can run safely and stably , Including whether the daily operation of the system is normal 、 Can you recover quickly in case of online production failure 、 Is there any corresponding means to deal with emergencies , There is only one general purpose , Is to do everything possible to ensure that no matter what the circumstances , There are measures or means to quickly restore the operation of the business . Of course, specific to different business systems , Different guarantee levels can be divided according to the requirements of business timeliness and importance , This article omits the following table .

The application system operation and maintenance is compared with the operation and maintenance for a single product , There are two characteristics :


  • There are many operation and maintenance objects and they may be complex ​. Existing basic operating system 、 database , There are also self-developed applications , It may also involve a lot of open source software, such as Kafka、Zookeeper、Redis、MongoDB wait .
  • There are many people to deal with in the operation and maintenance process ​. Operating system administrator 、 Database administrator 、 Various monitoring tool administrators 、 Project developer 、 Business department personnel, etc .

These two characteristics determine that the application system operation and maintenance management personnel cannot be proficient in all fields , How to achieve the goal of operation and maintenance depends more on management means than technical capabilities , The management means can be divided into three levels :


  • Means to find problems as early as possible .
  • Means to quickly restore business .
  • Means for emergency disposal of faults .

I think the importance of these three levels decreases in turn , If we do our homework well , It may reduce the probability of subsequent events , Here are the specific means of each level .

Means to find problems as early as possible


  • monitor ​. The importance of this goes without saying , Comprehensive monitoring indicators and sensitive thresholds enable us to receive many alarm messages , But the most important thing is that we can find out which one will affect the business .
  • Routine inspection ​. The patrol inspection shall be conducted in an automated way , However, the interpretation of the patrol inspection report must be carried out by the operation and maintenance management personnel , Try to get to the bottom of every exception . Patrol inspection includes the inspection of the operating system , For example, disk space 、 File handle, etc , It also includes checking the database , for example AWR The report 、 Slow query, etc , It should also include the inspection of business systems , Including whether the business calendar is correct 、 Whether the number of people online in the system has broken new and advanced, etc .
  • Duty system ​. Whether it's on-site or not , I think it is very necessary to adhere to the duty system , Responsibilities are implemented through duty arrangement , Avoid the embarrassment of three monks without water .

Means to quickly restore business


  • Flow isolation ​. I think this is the best way to restore business , But the investment required is often very large , The first application is cluster deployment , Only in this way can traffic scheduling and isolation be supported , Secondly, it is necessary to provide isolation tools or means for operation and maintenance personnel , If the operation and maintenance personnel can put forward this requirement in the development stage , A lot of manpower can be saved in the subsequent operation and maintenance process .
  • restart app ​. The premise is to ensure that your application is state insensitive or supports graceful restart , This method can often solve 80% Production problems .
  • Restart the operating system ​. If restarting the application does not solve the problem , Then restart the operating system .
  • Active standby switching ​. For applications deployed in the active / standby mode , This is often a last resort , And generally do not dare to try , Especially when there is data synchronization .

Before implementing the rapid recovery business approach , We should all remember to collect and record the on-site information of the faulty equipment , Check for developers 、 Reproduction problems provide important field data .

Means for emergency disposal of faults

If the means of the first two parts can not help solve the production problem , To reach this level, you need to prepare in advance , For example, daily backup 、 Remote backup, etc , If there is no daily backup , There is an ultimate solution , That's it ​​ Open and send an emergency version ​​ .

原网站

版权声明
本文为[wx61eaae213a986]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203011841177765.html