当前位置:网站首页>Service online governance

Service online governance

2022-07-04 22:08:00 InfoQ

Online governance is based on the results of quantitative analysis , Adjust the operation status of online services through corresponding plans , Ensure the normal operation of online services , Next, we will discuss the common plans for online services , And how to ensure the automatic triggering and adjustment of the plan .

The ideal way to quickly locate the fault and stop the loss is to connect the fault location with the implementation of the plan , When something goes wrong , Be able to judge the general type of fault and the corresponding plan , And trigger the automatic implementation of the plan .

Of course, there are many aspects to consider in the actual fault handling process , Although not all faults can be prepared in advance , But we can classify faults according to historical faults and some prior knowledge , Establish corresponding plans .

in addition , When establishing the plan, it should be easy to implement and trigger , If it is not convenient to execute , It is difficult to deal with the fault in a short time , The most critical issue is to judge when the plan is triggered , And whether the plan should be implemented at present . Online service stability failures can be broadly classified as the following causes .

  • Failure caused by change
Change is the main source of stability failure , There are many sources of change in the broad sense of the system , The most common service changes generally include application changes 、 Configuration change and data change . In addition to service changes , Changes in environment and hardware , For example, the network bandwidth changes 、 Link change of machine room, etc , It can also be classified into the broad category of change .

  • Faults caused by flow and capacity changes
This type of failure corresponds to the sudden change of input flow analyzed in the previous stability guarantee , If the service doesn't have enough response mechanism in advance , It will lead to certain potential stability hazards .

  • Dependency failure
Dependent service failure will affect the upstream services that call dependent services , Dependent service failures can be divided into strong dependent service failures and weak dependent service failures , These two will have corresponding treatment methods .

  • Computer room 、 Network and other hardware and environment failures
Hardware and environmental failures are characterized by the inability to predict , Randomness and contingency are very big , And once it happens, it is often a system level problem , There will be serious consequences .

  • other
such as ID Failure caused by generator overflow .

The scenario of fault refers to the category of stability fault according to the above , And then refine some scenes that are convenient for identification and judgment , Such as sudden increase of inlet flow 、 Access layer failure 、 Strongly dependent service failure 、 Weak dependency service failure and other subdivided scenarios . The purpose of dividing these scenarios is to , Further identify the root cause of the fault ( Not necessarily the most fundamental reason , It is classified from the perspective of stop loss ). Therefore, it can be applied to those scenarios that are easy to judge the fault through the observability index , And it is convenient to formulate the corresponding scenario plan , Make scene classification .

For downgrade 、 The current limiting and redundant cut-off scenarios are relatively clear , Can be based on Metric Make fault diagnosis , And automatically get through with the plan , Take the dependent service failure as an example , According to Metric The success rate index changes month on month , Determine whether the dependent service is abnormal , If you pass Metric It is found that there are indeed exceptions at present , First query the change management platform , Depends on whether the service currently has relevant change operations .

If there are changes , It is recommended that the person who relies on the service interface immediately roll back the change ; If there is no change operation , Then judge whether the current invocation of dependent services is strongly dependent or weakly dependent , If it is a weak dependency , You can start the automatic degradation plan to degrade dependent services , If it's a strong dependence , Demotion certainly won't solve the problem , The redundancy switching plan can be prepared in advance , Start service level 、 Cluster level or machine room level traffic switching .

be based on Metric Get through with the plan , The goal is to evolve towards automation and intelligence of fault location , But it needs to be pushed forward step by step according to the actual situation , For some scenes that are not easy to judge , Caution is recommended , Avoid possible miscalculation , At the same time, the plan shall be rehearsed regularly , Ensure the effectiveness of the plan trigger .
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/185/202207042132512963.html