当前位置:网站首页>Service online governance
Service online governance
2022-06-30 13:39:00 【51CTO】
Online governance is based on the results of quantitative analysis , Adjust the operation status of online services through corresponding plans , Ensure the normal operation of online services , Next, we will discuss the common plans for online services , And how to ensure the automatic triggering and adjustment of the plan .
The ideal way to quickly locate the fault and stop the loss is to connect the fault location with the implementation of the plan , When something goes wrong , Be able to judge the general type of fault and the corresponding plan , And trigger the automatic implementation of the plan .
Of course, there are many aspects to consider in the actual fault handling process , Although not all faults can be prepared in advance , But we can classify faults according to historical faults and some prior knowledge , Establish corresponding plans .
in addition , When establishing the plan, it should be easy to implement and trigger , If it is not convenient to execute , It is difficult to deal with the fault in a short time , The most critical issue is to judge when the plan is triggered , And whether the plan should be implemented at present . Online service stability failures can be broadly classified as the following causes .
- Failure caused by change
Change is the main source of stability failure , There are many sources of change in the broad sense of the system , The most common service changes generally include application changes 、 Configuration change and data change . In addition to service changes , Changes in environment and hardware , For example, the network bandwidth changes 、 Link change of machine room, etc , It can also be classified into the broad category of change .
- Faults caused by flow and capacity changes
This type of failure corresponds to the sudden change of input flow analyzed in the previous stability guarantee , If the service doesn't have enough response mechanism in advance , It will lead to certain potential stability hazards .
- Dependency failure
Dependent service failure will affect the upstream services that call dependent services , Dependent service failures can be divided into strong dependent service failures and weak dependent service failures , These two will have corresponding treatment methods .
- Computer room 、 Network and other hardware and environment failures
Hardware and environmental failures are characterized by the inability to predict , Randomness and contingency are very big , And once it happens, it is often a system level problem , There will be serious consequences .
- other
such as ID Failure caused by generator overflow .
The scenario of fault refers to the category of stability fault according to the above , And then refine some scenes that are convenient for identification and judgment , Such as sudden increase of inlet flow 、 Access layer failure 、 Strongly dependent service failure 、 Weak dependency service failure and other subdivided scenarios . The purpose of dividing these scenarios is to , Further identify the root cause of the fault ( Not necessarily the most fundamental reason , It is classified from the perspective of stop loss ). Therefore, it can be applied to those scenarios that are easy to judge the fault through the observability index , And it is convenient to formulate the corresponding scenario plan , Make scene classification .
For downgrade 、 The current limiting and redundant cut-off scenarios are relatively clear , Can be based on Metric Make fault diagnosis , And automatically get through with the plan , Take the dependent service failure as an example , According to Metric The success rate index changes month on month , Determine whether the dependent service is abnormal , If you pass Metric It is found that there are indeed exceptions at present , First query the change management platform , Depends on whether the service currently has relevant change operations .
If there are changes , It is recommended that the person who relies on the service interface immediately roll back the change ; If there is no change operation , Then judge whether the current invocation of dependent services is strongly dependent or weakly dependent , If it is a weak dependency , You can start the automatic degradation plan to degrade dependent services , If it's a strong dependence , Demotion certainly won't solve the problem , The redundancy switching plan can be prepared in advance , Start service level 、 Cluster level or machine room level traffic switching .
be based on Metric Get through with the plan , The goal is to evolve towards automation and intelligence of fault location , But it needs to be pushed forward step by step according to the actual situation , For some scenes that are not easy to judge , Caution is recommended , Avoid possible miscalculation , At the same time, the plan shall be rehearsed regularly , Ensure the effectiveness of the plan trigger .
边栏推荐
- Google Earth Engine(GEE)——GHSL:全球人类住区层,建成网格 1975-1990-2000-2015 (P2016) 数据集
- exlipse同时操作多行。比如同时在多行同列输入相同的文字
- 60 divine vs Code plug-ins!!
- DeFi“钱从哪来”?一个大多数人都没搞清楚的问题
- Write, append, read, and copy of golang files: examples of using bufio packages
- Open source of xinzhibao applet
- 【科研数据处理】[基础]类别变量频数分析图表、数值变量分布图表与正态性检验(包含对数正态)
- Assertions of regular series
- 数字化转型道阻且长,如何迈好关键的第一步
- 深度长文探讨Join运算的简化和提速
猜你喜欢

香港回归20余年,图扑数字孪生港珠澳大桥,超震撼

WTM major updates, multi tenancy and single sign on

60 divine vs Code plug-ins!!

MySQL access denied, opened as Administrator

【刷题篇】爱吃香蕉的珂珂

SQL考勤统计月报表

Google Earth Engine(GEE)——将字符串的转化为数字并且应用于时间搜索( ee.Date.fromYMD)

数据湖(十一):Iceberg表数据组织与查询

可观测,才可靠:云上自动化运维CloudOps系列沙龙 第一弹

QQ 居然被盗了?原因在这......
随机推荐
产品经理专业知识50篇(七)-如何建立一套完整的用户成长体系?
Jetpack Compose 实现完美屏幕适配
腾讯二面:@Bean 与 @Component 用在同一个类上,会怎么样?
Prometheus 2.29.0 new features
【系统分析师之路】第五章 复盘软件工程(敏捷开发)
Knowledge dissemination cannot replace professional learning!
VisualStudio and SQL
60 divine vs Code plug-ins!!
Prometheus 2.29.0 新特性
ERROR: Cannot uninstall ‘PyYAML‘. It is a distutils installed project and thus we cannot accurately
STM32 porting the fish component of RT thread Standard Edition
今日睡眠质量记录80分
Read all the knowledge points about enterprise im in one article
How to take the first step in digital transformation
RK356x U-Boot研究所(命令篇)3.2 help命令的用法
ABAP toolbox v1.0 (with implementation ideas)
Pytorch查看模型参数量和计算量
数字化转型道阻且长,如何迈好关键的第一步
How does MySQL merge columns?
SQL attendance statistics monthly report