当前位置:网站首页>Working ideas of stability and high availability guarantee
Working ideas of stability and high availability guarantee
2022-07-26 03:59:00 【Young】
01 Deep understanding of stability and high availability
Stability and high availability are two old words . With experience and feeling, we know , Improve these two indicators of the system , The system will be healthier , The product will also have a better user experience . But if you want to define stability and high availability, how to express it ? What is the difference and connection between stability and high availability ? I think we should first understand these two problems , To set clear goals , Systematically formulate a complete and feasible scheme .
Search Wikipedia for stability , The definition is as follows :
Stability is a mathematical or engineering term , To determine whether a system produces a bounded output at a bounded input . if , Call the system stable ; If no , The system is called unstable .
Let's look at high availability :
High availability ( English :high availability, Abbreviation for HA),IT The term , The ability of a system to perform its functions without interruption , Represents the availability of the system . It's one of the criteria for system design . High availability systems can run longer than the components that make up the system .
First, extract the key words from the definition of stability – System 、 Input 、 Output . In our current technical framework , You can think of an application as a system , Service requests between applications are input , The service response is output , When the service response meets the expectation, the application system is considered to be stable . When they combine with each other to form a larger system , When expressed to users as business products , User's request as input , Product expression as output , When the product function runs normally, it can be considered that the product system is stable . Sum up , The definition of stability can be summarized as – When the system receives input , Can produce the right 、 The expected output , Call the system stable ; otherwise , Call the system unstable .
Back to the proposition , Why stability guarantee ? Can you put it another way to improve stability ? From the above definition, we can conclude that , Stability describes the behavior of a system . Whether a system is stable , Just as we evaluate a person's health , It is difficult to describe completely in a declarative way , To quantify . But it can be judged quickly by negative way . People reduce the incidence of diseases through good diet and living habits , Keep fit . The same is true for ensuring the stability of the system or improving the stability of the system , We need various methods to avoid those unstable situations . The so-called more stable , Objectively, there is no , It is a subjective desire to avoid or reduce the occurrence of instability .
Unlike stability , Usability is a quantifiable indicator , The formula of calculation is described in Wikipedia :
According to the system damage 、 Time that can't be used , And the time from non operational to operational , Compared with the total operation time of the system .
We often hear 3 individual 9(99.9%),4 individual 9(99.99%) The measure is the availability of the system , High availability is to ensure that this index of the system is maintained at a high level . In the definition and description of the formula , The running time of the system is divided into three parts
- The normal operation time of the system , That is, the time when the system is in a stable state .
- System damage 、 Time that can't be used , That is, the time when the system is in an unstable state .
- The time when the system recovers from inoperable to operable state , That is, the time for the system to recover from unstable state to stable state .
The availability of the system is positively correlated with the stability of the system . But in real life , The system cannot always be in a stable state . Reverse thinking , Convert the above formula , More conducive to our analysis :

thus , The goal of this proposition ,KPI It's clear . The goal of ensuring the stability and high availability of the system is to keep the system in a stable working state , No negative impact on users , Avoid online problems and P The occurrence of level 1 fault . The core kpi Is the availability of the system . In order to improve the availability of the system , We should first ensure the stability of the system , Reduce the occurrence of unstable conditions , Secondly, when the system fails due to various components , When an unstable state occurs , Be able to quickly discover and restore it to a stable and available state .
02 The core idea of stability and high availability guarantee

Through the deduction above , Aiming at the goal of improving system availability , We can get two basic ideas for solving problems . Take a , To solve the problem , The first task is to identify and define problems . Therefore, in order to improve the stability of the system , Let's first list the common unstable situations in application systems , Another remedy to the case :
function : An error occurred in the function performed by the application , Fall short of expectations .
Capacity : When the number of requests received by the system increases , The application cannot handle , An exception or timeout occurred , Cause service failure .
Security : When the system receives an unauthorized or malicious attack request , Application exceptions or even service failures .
Fault tolerance : For the wrong use of users , The application cannot properly handle .
When this happens , It means that the system is in an unstable state , We need to be able to find and deal with it in time . And the reasons for these problems , In software system, it can be divided into the following three categories :
Human failure : Inadequate thinking in all aspects of software development , Or various problems caused by careless execution .
Hardware failure : The Internet is not working , There's not enough space on the hard disk , Memory crash, etc .
Software failure : Thread pool exception ,JVM abnormal , Middleware or other dependent application services are abnormal .
For a dynamically evolving system , There is no way to reduce the probability of failure to 0, Only in the process of software production , Establish process specifications and mechanisms to minimize their occurrence . Secondly, for a running system , We need to establish and improve the monitoring and early warning mechanism to find the faults in the system in time , And make the system recover quickly through the implementation of the plan . Based on the above conclusion , In order to improve the availability of the system , We need to start from the following three aspects : Failure prevention , Fault discovery and recovery .

People are far more likely to make mistakes than machines , Therefore, the most important thing for fault prevention is to establish a set of mechanism , Reach a consensus within the team and continue to carry out R & D work in accordance with this process , So as to reduce personal factors ( reflection 、 perform 、 Status, etc ) Impact on system stability . And fault discovery and fault recovery , It is necessary to quickly find and recover system abnormalities through system monitoring and emergency plan , So as to minimize the impact of the fault . Let's take our daily product development process as an example , Slave function 、 Capacity 、 Security 、 Fault tolerant this 4 Starting from three core elements , A set of scheme is given for reference only .

01 R & D specifications
design phase
Team breakdown document template
High availability design specification
Encoding phase
Code specification
General code specification
Specification for engineering structures
The coverage rate of the test sheet is
- Single test pass rate
- Code coverage
Log specification
- Security vulnerability repair specification
Release stage
- Change specifications : SanBanFu
02 Capacity guarantee
Capacity assessment
- Machine capacity
- DB Capacity
- Cache capacity
Pressure measurement
Current limiting scheme
Downgrade plan
03 Monitoring alarm
Log specification
Monitoring and combing
- Application foundation monitoring
- Gateway monitoring
- Service monitoring
- Business monitoring
- Current limiting monitoring
Alarm specification
Data check
04 Emergency quick reaction
Daily plan
- Hardware exception plan
- Middleware exception plan
- Business exception plan
Big promotion plan
Implementation specification of the plan
03 summary
How to ensure stability and high availability is a huge proposition , A large number of articles can be found on the intranet for any small part of the content . The purpose of writing this article is to summarize my understanding of stability and high availability guarantee , Let's share a set of framework ideas of the system . I hope you can have a more comprehensive understanding of production safety after reading , No details .

边栏推荐
- 微信小程序实现音乐播放器(5)
- Three solutions: when clicking the user to exit the login, press the back button of the browser, and you can still see the previous login page.
- Realization of online shopping mall system based on JSP
- Wechat applet realizes music player (5)
- PHP < => spacecraft operator (combined comparator)
- ZK snark: about private key, ring signature, zkksp
- 基本折线图:最直观呈现数据的趋势和变化
- [create interactive dice roller application]
- Uncaught TypeError: $(...). Onmousenter is not a function JS error, solution:
- Laravel8 implements interface authentication encapsulation using JWT
猜你喜欢

cpu和gpu已过时,npu和apu的时代开始

Asemi rectifier bridge gbu1510 parameters, gbu1510 specifications, gbu1510 package
![[programmers must] Tanabata confession strategy:](/img/55/0b43dd18c8682250db13ad94cd2c2c.png)
[programmers must] Tanabata confession strategy: "the moon meets the cloud, the flowers meet the wind, and the night sky is beautiful at night". (with source code Collection)

1311_ Hardware design_ Summary of ICT concept, application, advantages and disadvantages

Can't the container run? The Internet doesn't have to carry the blame

深度学习之DAT

第十八章:2位a~b进制中均位奇观探索,指定整数的 3x+1 转化过程,指定区间验证角谷猜想,探求4份黑洞数,验证3位黑洞数

6年从零开始的自动化测试之路,开发转测试我不后悔...

研发了 5 年的时序数据库,到底要解决什么问题?

Moco V2: further upgrade of Moco series
随机推荐
ACM mm 2022 | end to end multi granularity comparative learning for video text retrieval
容器跑不动?网络可不背锅
括号嵌套问题(建议收藏)
Div setting height does not take effect
A large factory developed and tested one, and strangled its neck with a mouse line
The B2B2C multi merchant system has rich functions and is very easy to open
想要做好软件测试,可以先了解AST、SCA和渗透测试
在 Istio 服务网格内连接外部 MySQL 数据库
Opencv learning notes - edge detection and Canny operator, Sobel operator, lapiacian operator, ScHARR filter
leetcode: 102. 二叉树的层序遍历
JS Base64 encoding and decoding
cpu和gpu已过时,npu和apu的时代开始
Dat of deep learning
PHP <=> 太空船运算符(组合比较符)
触觉智能分享-RK3568在景区导览机器人中的应用
PHP implements the algorithm of adding from 1 to 100
Dracoo master
Connect external MySQL databases in istio Service Grid
The convolution kernel is expanded to 51x51, and the new CNN architecture slak counterattacks the transformer
The second article, which is still unfinished, will be introduced again, and continue to explain oracledb_ Exporter monitors Oracle, a very low intrusive monitoring scheme.