当前位置：网站首页>Working ideas of stability and high availability guarantee

Working ideas of stability and high availability guarantee

2022-07-26 03:59:00 【Young】

01 Deep understanding of stability and high availability

Stability and high availability are two old words . With experience and feeling, we know , Improve these two indicators of the system , The system will be healthier , The product will also have a better user experience . But if you want to define stability and high availability, how to express it ？ What is the difference and connection between stability and high availability ？ I think we should first understand these two problems , To set clear goals , Systematically formulate a complete and feasible scheme .

Search Wikipedia for stability , The definition is as follows ：

Stability is a mathematical or engineering term , To determine whether a system produces a bounded output at a bounded input . if , Call the system stable ; If no , The system is called unstable .

Let's look at high availability ：

High availability （ English ：high availability, Abbreviation for HA）,IT The term , The ability of a system to perform its functions without interruption , Represents the availability of the system . It's one of the criteria for system design . High availability systems can run longer than the components that make up the system .

First, extract the key words from the definition of stability – System 、 Input 、 Output . In our current technical framework , You can think of an application as a system , Service requests between applications are input , The service response is output , When the service response meets the expectation, the application system is considered to be stable . When they combine with each other to form a larger system , When expressed to users as business products , User's request as input , Product expression as output , When the product function runs normally, it can be considered that the product system is stable . Sum up , The definition of stability can be summarized as – When the system receives input , Can produce the right 、 The expected output , Call the system stable ; otherwise , Call the system unstable .

Back to the proposition , Why stability guarantee ？ Can you put it another way to improve stability ？ From the above definition, we can conclude that , Stability describes the behavior of a system . Whether a system is stable , Just as we evaluate a person's health , It is difficult to describe completely in a declarative way , To quantify . But it can be judged quickly by negative way . People reduce the incidence of diseases through good diet and living habits , Keep fit . The same is true for ensuring the stability of the system or improving the stability of the system , We need various methods to avoid those unstable situations . The so-called more stable , Objectively, there is no , It is a subjective desire to avoid or reduce the occurrence of instability .

Unlike stability , Usability is a quantifiable indicator , The formula of calculation is described in Wikipedia ：

According to the system damage 、 Time that can't be used , And the time from non operational to operational , Compared with the total operation time of the system .

We often hear 3 individual 9（99.9%）,4 individual 9（99.99%） The measure is the availability of the system , High availability is to ensure that this index of the system is maintained at a high level . In the definition and description of the formula , The running time of the system is divided into three parts

The normal operation time of the system , That is, the time when the system is in a stable state .
System damage 、 Time that can't be used , That is, the time when the system is in an unstable state .
The time when the system recovers from inoperable to operable state , That is, the time for the system to recover from unstable state to stable state .

The availability of the system is positively correlated with the stability of the system . But in real life , The system cannot always be in a stable state . Reverse thinking , Convert the above formula , More conducive to our analysis ：

thus , The goal of this proposition ,KPI It's clear . The goal of ensuring the stability and high availability of the system is to keep the system in a stable working state , No negative impact on users , Avoid online problems and P The occurrence of level 1 fault . The core kpi Is the availability of the system . In order to improve the availability of the system , We should first ensure the stability of the system , Reduce the occurrence of unstable conditions , Secondly, when the system fails due to various components , When an unstable state occurs , Be able to quickly discover and restore it to a stable and available state .

02 The core idea of stability and high availability guarantee

Through the deduction above , Aiming at the goal of improving system availability , We can get two basic ideas for solving problems . Take a , To solve the problem , The first task is to identify and define problems . Therefore, in order to improve the stability of the system , Let's first list the common unstable situations in application systems , Another remedy to the case ：

function ： An error occurred in the function performed by the application , Fall short of expectations .
Capacity ： When the number of requests received by the system increases , The application cannot handle , An exception or timeout occurred , Cause service failure .
Security ： When the system receives an unauthorized or malicious attack request , Application exceptions or even service failures .
Fault tolerance ： For the wrong use of users , The application cannot properly handle .

When this happens , It means that the system is in an unstable state , We need to be able to find and deal with it in time . And the reasons for these problems , In software system, it can be divided into the following three categories ：

Human failure ： Inadequate thinking in all aspects of software development , Or various problems caused by careless execution .
Hardware failure ： The Internet is not working , There's not enough space on the hard disk , Memory crash, etc .
Software failure ： Thread pool exception ,JVM abnormal , Middleware or other dependent application services are abnormal .

For a dynamically evolving system , There is no way to reduce the probability of failure to 0, Only in the process of software production , Establish process specifications and mechanisms to minimize their occurrence . Secondly, for a running system , We need to establish and improve the monitoring and early warning mechanism to find the faults in the system in time , And make the system recover quickly through the implementation of the plan . Based on the above conclusion , In order to improve the availability of the system , We need to start from the following three aspects ： Failure prevention , Fault discovery and recovery .

People are far more likely to make mistakes than machines , Therefore, the most important thing for fault prevention is to establish a set of mechanism , Reach a consensus within the team and continue to carry out R & D work in accordance with this process , So as to reduce personal factors （ reflection 、 perform 、 Status, etc ） Impact on system stability . And fault discovery and fault recovery , It is necessary to quickly find and recover system abnormalities through system monitoring and emergency plan , So as to minimize the impact of the fault . Let's take our daily product development process as an example , Slave function 、 Capacity 、 Security 、 Fault tolerant this 4 Starting from three core elements , A set of scheme is given for reference only .

01 R & D specifications

design phase
- Team breakdown document template
- High availability design specification
Encoding phase
- Code specification
  - General code specification
  - Specification for engineering structures
- The coverage rate of the test sheet is
  - Single test pass rate
  - Code coverage
- Log specification
  - Security vulnerability repair specification
Release stage
- Change specifications ： SanBanFu

02 Capacity guarantee

Capacity assessment
- Machine capacity
- DB Capacity
- Cache capacity
Pressure measurement
Current limiting scheme
Downgrade plan

03 Monitoring alarm

Log specification
Monitoring and combing
- Application foundation monitoring
- Gateway monitoring
- Service monitoring
- Business monitoring
- Current limiting monitoring
Alarm specification
Data check

04 Emergency quick reaction

Daily plan
- Hardware exception plan
- Middleware exception plan
- Business exception plan
Big promotion plan
Implementation specification of the plan

03 summary

How to ensure stability and high availability is a huge proposition , A large number of articles can be found on the intranet for any small part of the content . The purpose of writing this article is to summarize my understanding of stability and high availability guarantee , Let's share a set of framework ideas of the system . I hope you can have a more comprehensive understanding of production safety after reading , No details .