当前位置：网站首页>Data intensive application system design - Application System Overview

Data intensive application system design - Application System Overview

2022-07-25 23:34:00 【Adong lazy】

《 Design of data intensive application system 》 - Application system overview

introduction

The overview of system application is the part of pure theory, although it is very simple , But after reading it, I found that many times, some terms are very narrow in my own concept , In the book, the author uses a more rigorous explanatory discourse to discuss some common problems in software and system design .

Some experiments and actual cases mentioned in the book are quite interesting . Also because it is the first chapter , The content is usually not difficult and boring at first , It is also an interesting chapter .

Introduce

Modern application design tends to be more unitary and modular , The amount of data in modern information systems is expanding rapidly , In exchange for complex data and changeable modules , Application systems usually need to include the following .

database ： Store the data .
Cache ： Reduce operation costs for complex operations , such as CPU The cache of , Hard disk cache, etc .
Indexes ： Establish fast data search and filtering .
Stream processing ： Communicate asynchronously with another process .
The batch ： Processing large amounts of accumulated data .

Re understand the data system

In the architecture of a data system , We usually judge the three features of an application system , These three characteristics are ： reliability 、 Scalable , Maintainability .

reliability

The so-called reliability does not only mean that the system can operate normally in case of abnormalities , It actually contains more ：

The application performs the functions expected by the user .
Tolerate wrong data or incorrect operation .
Reasonable to the system load and release performance .
Rights management .

In short, reliability refers to the reliability of the program and the system architecture, as well as ensuring the security of data .

In addition, some common terms closely related to reliability need to be explained more strictly ：

Fault tolerance ： Fault tolerance does not mean allowing certain errors , But to Allow some specific problems to arise under the premise of foresight .
Faults and failures ： Failure refers to the condition that the component deviates from the original setting , The system may recover , Failure means that the whole business system is paralyzed and will not be able to provide services .
- It is difficult to eliminate faults , But in more cases, it is not impossible to solve the problem , But from solving problems to program problems , That is to say BUG Top stack BUG.
- There are some ways to detect whether the repair system itself “ normal ”, in the past Neflix Of Chaos Monkey, Literally, it's a noisy monkey , This component detects system problems by simulating some common faults , Although relatively small, it is more interesting .

expand ： For many small teams and projects Simian Army It may not make much sense , but Chaos Monkey The idea behind it is worth learning and using . Chaos Monkey It mainly includes the following contents 1、Exception Assault （ Throw an exception attack ） 2、Kill Assault （ Kill process attack ） 3、Latency Assault （ Delay the Caton attack ） 4、Memory Assault （ Memory overflow attack ） You can see Simian Army project ：Netflix/SimianArmy · GitHub. If you can visit slideshare, You can also look at this slides： RE: invent: Chaos Monkey.

Hardware failure ： Hardware failure is usually by adding spare components in case of need , such as RAID Hard disk , The lithium battery , Asynchronous standby , In order to achieve reliability and ensure high availability , But in recent years, software fault tolerance has gradually become a new means , For example, upgrade the patch by taking turns under the multi node mode , Upgrade without destroying the cluster .
Software error ： Software errors are more about being hidden for a long time without being found BUG, Although the probability of error is relatively small , But once something goes wrong, it will be a very complicated troubleshooting process . Software reliability assurance is always unreliable , Even if it seems “ forever ” Some monitoring and defensive measures are also needed for places that cannot be misplaced , This can ensure that the problem can be checked at the first time .（ This sentence is very important ）
Human error ： The more complex the link, the more likely it is to make mistakes , The online process is in the charge of one person, and basically only appears in some garbage companies , Every formal process company has a similar or less strict online process , However, online configuration is often the most likely to encounter human error when the system is updated online .

The guarantee of reliability means the level of development and operation costs , So it is the most noteworthy thing .

Extensibility

How to describe performance

Extensibility refers to twitter about Huge fan out structure Solutions for , The typical performance of this structure is that a user receives a large amount of attention , After that, when users who are concerned publish new content, they will fan out huge requests , So as to support the demand of massive message release .

This structure is obviously a typical business scenario of massive single node publishing, subscribing and broadcasting , There are two kinds of push solutions for bloggers and niche anchors with millions of followers in twitter ：

If it is a relational database solution , It is to push new tweets one by one according to the chronological order of followers .
Use cache to push , When pushing users, if the same target is found according to the cache , Then directly fetch the cache and push , This reduces a lot of system overhead .

There are some problems with both of them , The first is that it will aggravate the reading load pressure , Although the second can obviously solve the problem of the first , But there is obviously waste , The final extension is to find a combination of two situations , For users who pay less attention, you can use the first scheme to update in real time , But for users who pay a lot of attention, we need the second way .

Therefore, it can be considered that the explanation for scalability is to find a balance between different solutions .

To reach the equilibrium point, we need to consider the following two factors ：

1. How many machines need to be expanded to maintain the original performance when the business increases .2. How to maintain performance when system resources remain unchanged .

Delay and response time differences ？ The main difference is that the response time will include the time taken by a server from the moment of request to the moment of return , So here we need to add network overhead . The delay is reflected in how long it takes to deal with the task . Here's an example ： The total time we spend uploading files from the moment we click the upload button to the moment we return the correct results is called response time , The delay refers to how long it takes to wait for the upload action itself .

So how should we measure performance indicators , We usually use the average response time as a reference , But average response time can't actually restore performance .

The conclusion is that Median + response time Sorting means judging performance , Process according to the user's response time and size .

Here's another example of Amazon's response time when users visit the website based on shopping , response time 1S And sales .

For the optimization of a request , In the early stage 2-3S The completion time is reduced to 1S It's very effective , But to 1S Optimization within , Like optimization 99% Satisfactory request and 1% Dissatisfied request , Optimize to the end 1% The cost is much higher than the actual benefits , So at this time, we need to change our thinking instead of sticking to the old methods .

Therefore, the optimization index of scalability is not for ultimate optimization , Excellent optimization is a logarithmic process , If you can't reach this target, you have to consider the cost and whether it's worth continuing .

To observe system optimization , During load testing, the request generation end must be concurrent instead of blocking , Otherwise, there will be test errors . This sentence means that before any test, it is necessary to ensure that the test is reasonable and reliable .

Load increase expansion

How to cope with the increase of expansion , At present, vertical expansion and horizontal expansion are more discussed , Vertical expansion refers to upgrading the old system configuration , Horizontal expansion is to deploy more machines to share the load .

In most cases, it may be considered that multiple machines with average performance are better than a few powerful machines , In fact, if the architecture is strong enough , Only a few servers with good performance can offset the effect of multiple servers , And the horizontal expansion to a certain extent is limited .

In the vertical expansion and horizontal expansion, it is divided into stateful node expansion and stateless node expansion .

A more common approach for stateful nodes is to use a high-performance server to service requests with a single machine load （ Note that the services here are only application services ）, When a single point of service cannot be supported, the plan of horizontal expansion will be considered .

Stateless nodes tend to expand horizontally , Therefore, it usually requires multiple machines or the use of primary and standby backup for disaster recovery .

The future application is likely to be a distributed oriented architecture , Modern distributed programming interfaces and frameworks are constantly improving .

The last point is that machines with the same throughput will have completely different architecture designs according to different business scenarios , Extensible structure usually means the independence between components and its own scalability , like TCP/IP The model is general .

But once the business architecture is established , The cost of adjusting the architecture in the future will be higher and higher .

Maintainability

Maintainability includes operation and maintenance , Simplicity and performability . Operation and maintenance means that the system stability can be maintained through the operation team in daily work , Simplicity is the ability to complete requirements with the simplest logic , It is also necessary to ensure that operators can use it simply , That is, the function is perfect and the system is complete ,

summary

These three features actually point to one feature ： Let the operation and maintenance personnel better maintain the system , Because no matter how many systems can be customized , Finally, the operation and maintenance personnel are required to complete the maintenance operation .

in addition , In order to realize the simplicity of the system , We have to introduce abstraction to solve the problem , High level languages also use abstraction to cover up CPU register , Assembly code , Complexity of system call .

Using a larger system requires more abstract thinking , Agile development mode is set up in modern system for this purpose , Test driven open mode and refactoring , From the domestic environment, the abuse trend of the two open models is relatively large , So we should pay more attention to the application of refactoring .

Details about refactoring , Can be in 《 restructure 》 In this book .

At the end

The first chapter discusses reliability 、 Extensibility 、 The theoretical concept of maintainability , At the same time, it combs the challenges faced by various application systems in the current era , And looking forward to the future, with the code maturity and technology improvement of distributed architecture , Even the smile project can play distributed , And it may be in the near future ......

relation

System acceleration ： Amda's law

The core ： Suppose the program is divided into two parts ： Non parallelizable part and parallelizable part .

explain ：

Suppose a program on disk is loaded into memory , Scan directories and create files . The parts that scan directories and create file lists cannot be parallelized , But processing files can be done in parallel .

Then according to the above instructions , We can define the following variables ：

T = Total time of serial execution
B = Total time that cannot be parallelized
T- B = The total time of the parallel part

From the formula ,T-B This part of the time is really parallel and can be improved by CPU Or thread performance optimization time . When more than one CPU When executing parallel parts with threads, the calculation formula is ：

T(N) = B + (T - B) / N（N For the processor or CPU Number ）

We don't need to remember complex formulas here , We only need to know that this amda theorem explains the need to optimize the performance of a program , The performance improvement is not as much as we think , Software, hardware and equipment IO, Every item such as memory may affect the performance of the program , At the same time, the effect of single performance optimization may not be significant , When optimizing, we also need to consider from many aspects according to the actual situation .

More references