当前位置：网站首页>In depth interpretation: distributed system resilience architecture ballast openchaos

In depth interpretation: distributed system resilience architecture ballast openchaos

2022-06-11 14:48:00 【Deep learning and python】

author | Siying , Jiahao , Mahai

Key Takeaways 1. Firstly, this paper introduces the concept of chaos engineering based on the complexity and stability of today's distributed systems , And elaborated OpenChaos Optimization and innovation based on the traditional chaos engineering . 2. The second part introduces OpenChaos The architecture of , The working principles of its reliability model and elastic model are explained in detail , And two practical cases show OpenChaos The effect that can be played in the actual application scenario . 3. The last part looks forward to the future , Put forward OpenChaos Follow up development direction .

back view

With Serverless、 Microservices （ Including service grid ） With the emergence of more and more container architecture applications , We build 、 The way in which systems are delivered and operated becomes more and more complex . This complexity increases the difficulty of system state observability . In the existing production environment , We have different ways to get information , Enhance the observability of the system . Initially, it may be very simple to give a specific condition , Produce a specific indicator output . further , Use structured and associated logs , Or distributed tracking , Introduce event bus, such as Apache EventMesh、EventGrid etc. . With Codeless Combined applications are developing rapidly ,Serverless The design concept has also been gradually accepted by some distributed system designers . No operation and maintenance 、 Pay as you go 、 Ultimate flexibility 、 Multi rent and pool sharing are all forcing us to re-examine the rationality of the old architecture , Give birth to the continuous evolution of new architecture . Integration architecture is the most frequently mentioned word in recent years , In the past, support online / The complex architecture of offline systems is constantly integrated , Adapt to various business scenarios through separable and combined design and deployment methods . In this context , We began to examine and think carefully , Is there a more modern tool , It can help find and deal with the problems brought by the base of distributed cloud on architecture design and upper application, such as reliability 、 Security 、 A series of tough architecture challenges such as flexibility .

The idea of chaos engineering has brought us some inspiration .Netflix Initially, in order to relocate the infrastructure, the cloud was created Chaos Monkey, This opened the prelude to chaos Engineering . To the later ,CNCF Set up a special interest group , Hope to promote the birth of standards in this field .OpenChaos The founding team also had many rounds of communication with the pioneers of these communities in the early stage . It is a pity ,2019 The interest group was merged into App Delivery SIG There was not much movement after . In recent years, under the strong guidance of national policies , The digital upgrading of enterprises is accelerating , More and more CIO、CTO even CEO Began to pay attention to and put into the practice of chaos Engineering . The chaos Engineering Laboratory led by China Academy of communications and communications is also actively promoting the formulation of standards in this field , From full link voltage measurement 、 Chaos fault is introduced into the multi cloud and multi active reference architecture that will lead to future architecture change , All these indicate the rapid development of this industry . According to the research and statistics of domestic and foreign science and technology media , To 2025 year ,80% The organization will implement chaotic engineering practice . Through full link voltage test 、 Chaotic fault , And the overall introduction of strategies such as multi cloud and multi live architecture , Unexpected downtime can be reduced 50%, Realize the real second level RTO/RPO, Let the app 、 Business innovation is more focused .

Good medicine is good , But there are limitations

The most basic process of chaos engineering is to automatically perform experiments on a small scale in the production environment , Inject faults randomly into the system , To observe “ System boundary ”. It mainly focuses on the fault tolerance and reliability of the system . At present, most chaotic engineering tools on the market , It tends to construct fault types dominated by black box random , For the underlying infrastructure （ Hardware 、 operating system 、 Database and middleware ） Less understanding and insight . Lack of unified framework standards 、 ripe specific Metrics . meanwhile , Analysis feedback is weak , Unable to give comprehensive and thorough diagnostic recommendations , Especially through reinforcement learning , Generative AI And other capabilities can further solve the current random fault injection problem , Conduct self-healing risk analysis and optimization suggestions .

For distributed systems with more complex features , It is limited to only observe the performance of the system to deal with faults , And relying on observation is extremely subjective , It is difficult to form a unified evaluation standard , It is also difficult to analyze system defects for performance . The observability of the system , Not only does it need the full coverage of the model , A complete monitoring system is also needed , And provide a comprehensive result report , Even intelligent prediction , To guide the architecture to improve its resilience . Senior technical experts in distributed field 、 Open source top projects Apache RocketMQ、OpenMessaging Feng Jia, the original founder, once said “ The evolution of cloud native distributed architecture is moving towards assembly architecture 、 Further development of resilient architecture ”. In this context , He proposed and led the team to create OpenChaos This emerging project .

OpenChaos The essential problem to be solved

Resilient architecture , High reliability of coverage 、 Security 、 elastic 、 Features such as immutable infrastructure . Realizing a truly resilient architecture is undoubtedly the evolution direction of modern distributed systems . Resilience for distributed systems ,OpenChaos With the help of chaotic engineering thought , Extend its definition . For some unique attributes of Distributed Systems , Such as Pub/Sub Delivery semantics and push efficiency of the system , Accurate scheduling of scheduling system 、 Elastic expansion and cold start efficiency ,streaming Stream batch real-time performance of the system 、 Back pressure efficiency , Recall and precision of retrieval system , Consistency of consensus components of distributed system, etc , Set up a special detection model .OpenChaos Built in extensible model support , In order to verify the resilience performance under large-scale data requests and various fault shocks , Provide suggestions for further optimization of the architecture .

Architecture and case analysis

chart 1

The overall architecture

OpenChaos Here's how it works ： Control the whole process , Be responsible for making the cluster nodes form a distributed cluster to be tested , And will find the corresponding according to the distributed infrastructure to be tested Driver Assembly and load , Establish a corresponding number of clients according to the set number of concurrent clients . Control node according to Model The execution process defined by the component controls the client to operate the cluster . During the drill ,Detection Model Corresponding events will be introduced to cluster nodes according to different observation characteristics .Metrics The module will monitor the performance of the tested cluster in the experiment . After the drill ,Checker The component will automatically analyze the business and non business data in the experiment , Get test results and output visual charts .

Pictured 1 Shown ,OpenChaos The overall structure of can be divided into management 、 Execution layer and tested component layer .

The top layer is the management , It contains the user interface and controller （Control）, The controller is responsible for scheduling the components of the engine layer . The lowest layer is the tested component , It can be a self deployed distributed system cluster , Cloud hosted systems can also be distributed .

The middle layer is the execution layer , It's also OpenChaos The secret of great power . Model （Model） It is the basic unit of the process executed , It defines the basic form of operating on distributed systems . The controller loads the driver of the distributed system to be tested in the model （Driver）, And create the corresponding client according to the configured concurrent number （Client）, Finally, the client is used to perform operations on the distributed system . Test model （Detection Model） Corresponding events will be introduced according to different observation characteristics concerned by users , Such as the introduction of faults or the expansion and contraction of the system .Metrics The module will monitor the performance of the tested cluster in the experiment . After the drill , Measurement model （Measurement Model） The component will automatically analyze the business and non business data in the experiment , Get test results and output visual charts .

Detection model and measurement model

Test model

Traditional chaos engineering mainly focuses on the stability of the system , Their common implementation is to simulate some common general faults through black box fault injection .OpenChaos The detection model in focuses on higher dimensional attributes —— toughness , It includes reliability , It also includes such as elasticity 、 Detection model of security and other characteristics . Compared with the traditional chaos Engineering ,OpenChaos It not only supports universal black box fault injection , It can also maliciously target the active and standby switching of distributed basic software, such as message or cache , Customized detection of brain fissure and other problems caused by network partition , To see how they behave in this situation .

Measurement model

Because of the complexity of Distributed Systems , For the observation of distributed system toughness, a simple and intuitive analysis report is needed , To make it easier for people to find the possible defects and deficiencies of Distributed Systems . The measurement model will analyze the performance of the system , Output results and charts with unified standardized calculation , It is convenient for users to conduct comparative analysis .

Take the stability evaluation of message system as an example , The measurement model will be based on the fault injection and system performance in the experiment , Calculate the... Of the system RPO（Recovery Point Objective） and RTO（Recovery Time Objective）. Output the processing semantics of the cluster , If it is in conformity with at least once or exactly once; Fault recovery , Whether the system is unavailable during the failure , And unavailable recovery time ; Whether the expected zoning sequence is met under fault ; The response time of the system in the whole experimental process, etc .

Reliability case analysis

We use OpenChaos Yes ETCD Cluster reliability test , Found that the network is disconnected at the primary node 、 In the case of a separate partition ,ETCD From the perspective of the client , The cluster lacks automatic recovery capability .

chart 2

Here is the use of OpenChaos An example of experimental results , It's a 3 node ETCD The cluster is disconnected from the slave node network at the master node , When it becomes a partition alone , The simulated traffic rate is 1000 tps.

chart 3

It can be seen from the figure that the experiment lasted 10 minute , Ten times of network partition failure of master node are injected , The interval is 30 second , Cluster performance is inconsistent during failure . The following figure shows the more detailed experimental results .

In the 1/3/6/8 During this failure , The cluster cannot recover itself ; Cost during other failures 7 The cluster will be restored to the available state in seconds , But there was no data loss in the whole experiment .

chart 4

By viewing the experimental process information , It is found that every time the primary node is partitioned , The cluster can transfer the primary node during failure . By analyzing the source ,ETCD The client is facing ETCD When there is an internal error , There will be no retry to connect to other nodes . The node that causes the client to connect preferentially is the primary node , And when unavailable , Even if the master node has been successfully transferred , The overall cluster is restored to availability , The business is still in an unrecovered state . It's time for us to report to ETCD Community , Waiting for further repair .

Elastic case analysis

Elasticity is also a key capability of distributed systems , In addition to reliability ,OpenChaos Support the measurement and evaluation of system capacity expansion and contraction capacity . Unlike reliability , The resilience of distributed systems cannot trigger detection by scheduling fixed frequency events .OpenChaos The expansion and contraction can be triggered according to the operating system index or business index threshold set by the user . for example , You can specify the cluster CPU The expected value of average occupancy is 40%, Or the expected response time of the system is 100ms. The elasticity detection model will be based on the specified expected value and the current system performance , according to OpenChaos Built in algorithm to calculate the target size to be bounced , To trigger the expansion and contraction action . After the experiment , The measurement model calculates the cost of the cluster “ Acceleration ratio efficiency ”, And “ The cost of expansion and contraction ” And the performance of the cluster under the corresponding scale .

notes ：“ Acceleration ratio efficiency ” and “ The cost of expansion and contraction ” by OpenChaos An index to measure the resilience of distributed systems , The former represents the performance and effect of parallelization of distributed systems , The latter represents the rate at which the system scales .

The meaning of elasticity includes not only the scalability of instance nodes , It also includes specific business （ application ） The expansion and contraction capacity of the unit . To explore Kafka Best practices for partitioning , We designed experiments to explore individual topic Capacity expansion of partitions . In the experiment, we will also count the throughput of message sending and receiving under the number of different partitions , To understand the impact of the number of partitions on message throughput and the optimal number of partitions to achieve maximum throughput .

chart 5 For one on a three node cluster topic Partition from 1 Expand to 9000 At the time of the tps And delays .

chart 5

chart 6 Is the change of each index with experimental time .

chart 6

chart 7 Is a screenshot of the specific elasticity evaluation results , It shows that at different scales , The performance of the system and the cost and efficiency of elastic change . among changeCost and resilienceEfficienty Is the expansion and contraction cost and acceleration specific efficiency results described above .

chart 7

From the above results, we can see , ... under this experimental specification Kafka colony , newly added 1 The average time of a partition is about 20ms. When the number of partitions reaches 26 When the performance is optimal , In this case, the throughput reaches 130 ten thousand , here CPU The overall utilization rate reaches 93%. When the number of partitions reaches 450+ when , Performance is significantly reduced . When the number of partitions reaches 1992 when , Throughput down to 3.8 ten thousand ,CPU The overall utilization rate reaches 97%.

The future planning

at present OpenChaos Access to most distributed systems has been supported , Such as Apache Kafka、Apache RocketMQ、DLedger、 Redis、ETCD etc. . With the summer of open source 2022 Activities [1] The opening of , We have opened up more work on distributed system access , For college students to choose and participate .

meanwhile , Huawei cloud works closely with chaos Engineering Laboratory , Helped the Chinese Academy of information and communications to release the first in China 《 Distributed message queue stability evaluation standard 》, Is the main contributor to this standard . in addition , Huawei cloud middleware messaging product family is the only application service that has fully passed the acceptance standard .

Facing the future ,OpenChaos More general toughness standards and intelligent prediction functions will be introduced , In order to not only evaluate the existing capabilities of the architecture , It can also make predictions based on systematic observations , Avoid the occurrence of abnormalities beyond the toughness of the system itself . Go one step further , We will continue to polish the project , Integrate more distributed systems through ecological cooperation , Try to make OpenChaos The ballast stone that creates the toughness structure of the component cloth system , So as to promote the continuous evolution of cloud native architecture , Only when the time is critical “ Let the wind and the waves rise , Take a fishing boat ”.

[1] Open source summer 2022 Activities ：

https://summer-ospp.ac.cn/#/org/prodetail/221bf0008

Author's brief introduction ：

Siying , Senior R & D Engineer , Consistency algorithms for distributed systems , Resilient architecture , Pattern recognition has deep understanding and research .

Jiahao , Senior middleware R & D Engineer , Responsible for the design and R & D of Huawei cloud distributed middleware , Good at middleware performance optimization , I like the design concept of simplicity .

Mahai , Huawei cloud middleware reliability technology expert , Good at chaos Engineering 、 Performance testing , Event driven architecture design .

原网站

版权声明
本文为[Deep learning and python]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/162/202206111418150586.html

当前位置：网站首页>In depth interpretation: distributed system resilience architecture ballast openchaos

In depth interpretation: distributed system resilience architecture ballast openchaos

边栏推荐

猜你喜欢

随机推荐