当前位置:网站首页>In depth interpretation: distributed system resilience architecture ballast openchaos
In depth interpretation: distributed system resilience architecture ballast openchaos
2022-06-11 14:48:00 【Deep learning and python】
author | Siying , Jiahao , Mahai
Key Takeaways 1. Firstly, this paper introduces the concept of chaos engineering based on the complexity and stability of today's distributed systems , And elaborated OpenChaos Optimization and innovation based on the traditional chaos engineering . 2. The second part introduces OpenChaos The architecture of , The working principles of its reliability model and elastic model are explained in detail , And two practical cases show OpenChaos The effect that can be played in the actual application scenario . 3. The last part looks forward to the future , Put forward OpenChaos Follow up development direction .
back view
With Serverless、 Microservices ( Including service grid ) With the emergence of more and more container architecture applications , We build 、 The way in which systems are delivered and operated becomes more and more complex . This complexity increases the difficulty of system state observability . In the existing production environment , We have different ways to get information , Enhance the observability of the system . Initially, it may be very simple to give a specific condition , Produce a specific indicator output . further , Use structured and associated logs , Or distributed tracking , Introduce event bus, such as Apache EventMesh、EventGrid etc. . With Codeless Combined applications are developing rapidly ,Serverless The design concept has also been gradually accepted by some distributed system designers . No operation and maintenance 、 Pay as you go 、 Ultimate flexibility 、 Multi rent and pool sharing are all forcing us to re-examine the rationality of the old architecture , Give birth to the continuous evolution of new architecture . Integration architecture is the most frequently mentioned word in recent years , In the past, support online / The complex architecture of offline systems is constantly integrated , Adapt to various business scenarios through separable and combined design and deployment methods . In this context , We began to examine and think carefully , Is there a more modern tool , It can help find and deal with the problems brought by the base of distributed cloud on architecture design and upper application, such as reliability 、 Security 、 A series of tough architecture challenges such as flexibility .
The idea of chaos engineering has brought us some inspiration .Netflix Initially, in order to relocate the infrastructure, the cloud was created Chaos Monkey, This opened the prelude to chaos Engineering . To the later ,CNCF Set up a special interest group , Hope to promote the birth of standards in this field .OpenChaos The founding team also had many rounds of communication with the pioneers of these communities in the early stage . It is a pity ,2019 The interest group was merged into App Delivery SIG There was not much movement after . In recent years, under the strong guidance of national policies , The digital upgrading of enterprises is accelerating , More and more CIO、CTO even CEO Began to pay attention to and put into the practice of chaos Engineering . The chaos Engineering Laboratory led by China Academy of communications and communications is also actively promoting the formulation of standards in this field , From full link voltage measurement 、 Chaos fault is introduced into the multi cloud and multi active reference architecture that will lead to future architecture change , All these indicate the rapid development of this industry . According to the research and statistics of domestic and foreign science and technology media , To 2025 year ,80% The organization will implement chaotic engineering practice . Through full link voltage test 、 Chaotic fault , And the overall introduction of strategies such as multi cloud and multi live architecture , Unexpected downtime can be reduced 50%, Realize the real second level RTO/RPO, Let the app 、 Business innovation is more focused .
Good medicine is good , But there are limitations
The most basic process of chaos engineering is to automatically perform experiments on a small scale in the production environment , Inject faults randomly into the system , To observe “ System boundary ”. It mainly focuses on the fault tolerance and reliability of the system . At present, most chaotic engineering tools on the market , It tends to construct fault types dominated by black box random , For the underlying infrastructure ( Hardware 、 operating system 、 Database and middleware ) Less understanding and insight . Lack of unified framework standards 、 ripe specific Metrics . meanwhile , Analysis feedback is weak , Unable to give comprehensive and thorough diagnostic recommendations , Especially through reinforcement learning , Generative AI And other capabilities can further solve the current random fault injection problem , Conduct self-healing risk analysis and optimization suggestions .
For distributed systems with more complex features , It is limited to only observe the performance of the system to deal with faults , And relying on observation is extremely subjective , It is difficult to form a unified evaluation standard , It is also difficult to analyze system defects for performance . The observability of the system , Not only does it need the full coverage of the model , A complete monitoring system is also needed , And provide a comprehensive result report , Even intelligent prediction , To guide the architecture to improve its resilience . Senior technical experts in distributed field 、 Open source top projects Apache RocketMQ、OpenMessaging Feng Jia, the original founder, once said “ The evolution of cloud native distributed architecture is moving towards assembly architecture 、 Further development of resilient architecture ”. In this context , He proposed and led the team to create OpenChaos This emerging project .
OpenChaos The essential problem to be solved
Resilient architecture , High reliability of coverage 、 Security 、 elastic 、 Features such as immutable infrastructure . Realizing a truly resilient architecture is undoubtedly the evolution direction of modern distributed systems . Resilience for distributed systems ,OpenChaos With the help of chaotic engineering thought , Extend its definition . For some unique attributes of Distributed Systems , Such as Pub/Sub Delivery semantics and push efficiency of the system , Accurate scheduling of scheduling system 、 Elastic expansion and cold start efficiency ,streaming Stream batch real-time performance of the system 、 Back pressure efficiency , Recall and precision of retrieval system , Consistency of consensus components of distributed system, etc , Set up a special detection model .OpenChaos Built in extensible model support , In order to verify the resilience performance under large-scale data requests and various fault shocks , Provide suggestions for further optimization of the architecture .
Architecture and case analysis
chart 1
The overall architecture
OpenChaos Here's how it works : Control the whole process , Be responsible for making the cluster nodes form a distributed cluster to be tested , And will find the corresponding according to the distributed infrastructure to be tested Driver Assembly and load , Establish a corresponding number of clients according to the set number of concurrent clients . Control node according to Model The execution process defined by the component controls the client to operate the cluster . During the drill ,Detection Model Corresponding events will be introduced to cluster nodes according to different observation characteristics .Metrics The module will monitor the performance of the tested cluster in the experiment . After the drill ,Checker The component will automatically analyze the business and non business data in the experiment , Get test results and output visual charts .
Pictured 1 Shown ,OpenChaos The overall structure of can be divided into management 、 Execution layer and tested component layer .
The top layer is the management , It contains the user interface and controller (Control), The controller is responsible for scheduling the components of the engine layer . The lowest layer is the tested component , It can be a self deployed distributed system cluster , Cloud hosted systems can also be distributed .
The middle layer is the execution layer , It's also OpenChaos The secret of great power . Model (Model) It is the basic unit of the process executed , It defines the basic form of operating on distributed systems . The controller loads the driver of the distributed system to be tested in the model (Driver), And create the corresponding client according to the configured concurrent number (Client), Finally, the client is used to perform operations on the distributed system . Test model (Detection Model) Corresponding events will be introduced according to different observation characteristics concerned by users , Such as the introduction of faults or the expansion and contraction of the system .Metrics The module will monitor the performance of the tested cluster in the experiment . After the drill , Measurement model (Measurement Model) The component will automatically analyze the business and non business data in the experiment , Get test results and output visual charts .
Detection model and measurement model
Test model
Traditional chaos engineering mainly focuses on the stability of the system , Their common implementation is to simulate some common general faults through black box fault injection .OpenChaos The detection model in focuses on higher dimensional attributes —— toughness , It includes reliability , It also includes such as elasticity 、 Detection model of security and other characteristics . Compared with the traditional chaos Engineering ,OpenChaos It not only supports universal black box fault injection , It can also maliciously target the active and standby switching of distributed basic software, such as message or cache , Customized detection of brain fissure and other problems caused by network partition , To see how they behave in this situation .
Measurement model
Because of the complexity of Distributed Systems , For the observation of distributed system toughness, a simple and intuitive analysis report is needed , To make it easier for people to find the possible defects and deficiencies of Distributed Systems . The measurement model will analyze the performance of the system , Output results and charts with unified standardized calculation , It is convenient for users to conduct comparative analysis .
Take the stability evaluation of message system as an example , The measurement model will be based on the fault injection and system performance in the experiment , Calculate the... Of the system RPO(Recovery Point Objective) and RTO(Recovery Time Objective). Output the processing semantics of the cluster , If it is in conformity with at least once or exactly once; Fault recovery , Whether the system is unavailable during the failure , And unavailable recovery time ; Whether the expected zoning sequence is met under fault ; The response time of the system in the whole experimental process, etc .
Reliability case analysis
We use OpenChaos Yes ETCD Cluster reliability test , Found that the network is disconnected at the primary node 、 In the case of a separate partition ,ETCD From the perspective of the client , The cluster lacks automatic recovery capability .
chart 2
Here is the use of OpenChaos An example of experimental results , It's a 3 node ETCD The cluster is disconnected from the slave node network at the master node , When it becomes a partition alone , The simulated traffic rate is 1000 tps.
chart 3
It can be seen from the figure that the experiment lasted 10 minute , Ten times of network partition failure of master node are injected , The interval is 30 second , Cluster performance is inconsistent during failure . The following figure shows the more detailed experimental results .
In the 1/3/6/8 During this failure , The cluster cannot recover itself ; Cost during other failures 7 The cluster will be restored to the available state in seconds , But there was no data loss in the whole experiment .
chart 4
By viewing the experimental process information , It is found that every time the primary node is partitioned , The cluster can transfer the primary node during failure . By analyzing the source ,ETCD The client is facing ETCD When there is an internal error , There will be no retry to connect to other nodes . The node that causes the client to connect preferentially is the primary node , And when unavailable , Even if the master node has been successfully transferred , The overall cluster is restored to availability , The business is still in an unrecovered state . It's time for us to report to ETCD Community , Waiting for further repair .
Elastic case analysis
Elasticity is also a key capability of distributed systems , In addition to reliability ,OpenChaos Support the measurement and evaluation of system capacity expansion and contraction capacity . Unlike reliability , The resilience of distributed systems cannot trigger detection by scheduling fixed frequency events .OpenChaos The expansion and contraction can be triggered according to the operating system index or business index threshold set by the user . for example , You can specify the cluster CPU The expected value of average occupancy is 40%, Or the expected response time of the system is 100ms. The elasticity detection model will be based on the specified expected value and the current system performance , according to OpenChaos Built in algorithm to calculate the target size to be bounced , To trigger the expansion and contraction action . After the experiment , The measurement model calculates the cost of the cluster “ Acceleration ratio efficiency ”, And “ The cost of expansion and contraction ” And the performance of the cluster under the corresponding scale .
notes :“ Acceleration ratio efficiency ” and “ The cost of expansion and contraction ” by OpenChaos An index to measure the resilience of distributed systems , The former represents the performance and effect of parallelization of distributed systems , The latter represents the rate at which the system scales .
The meaning of elasticity includes not only the scalability of instance nodes , It also includes specific business ( application ) The expansion and contraction capacity of the unit . To explore Kafka Best practices for partitioning , We designed experiments to explore individual topic Capacity expansion of partitions . In the experiment, we will also count the throughput of message sending and receiving under the number of different partitions , To understand the impact of the number of partitions on message throughput and the optimal number of partitions to achieve maximum throughput .
chart 5 For one on a three node cluster topic Partition from 1 Expand to 9000 At the time of the tps And delays .
chart 5
chart 6 Is the change of each index with experimental time .
chart 6
chart 7 Is a screenshot of the specific elasticity evaluation results , It shows that at different scales , The performance of the system and the cost and efficiency of elastic change . among changeCost and resilienceEfficienty Is the expansion and contraction cost and acceleration specific efficiency results described above .
chart 7
From the above results, we can see , ... under this experimental specification Kafka colony , newly added 1 The average time of a partition is about 20ms. When the number of partitions reaches 26 When the performance is optimal , In this case, the throughput reaches 130 ten thousand , here CPU The overall utilization rate reaches 93%. When the number of partitions reaches 450+ when , Performance is significantly reduced . When the number of partitions reaches 1992 when , Throughput down to 3.8 ten thousand ,CPU The overall utilization rate reaches 97%.
The future planning
at present OpenChaos Access to most distributed systems has been supported , Such as Apache Kafka、Apache RocketMQ、DLedger、 Redis、ETCD etc. . With the summer of open source 2022 Activities [1] The opening of , We have opened up more work on distributed system access , For college students to choose and participate .
meanwhile , Huawei cloud works closely with chaos Engineering Laboratory , Helped the Chinese Academy of information and communications to release the first in China 《 Distributed message queue stability evaluation standard 》, Is the main contributor to this standard . in addition , Huawei cloud middleware messaging product family is the only application service that has fully passed the acceptance standard .
Facing the future ,OpenChaos More general toughness standards and intelligent prediction functions will be introduced , In order to not only evaluate the existing capabilities of the architecture , It can also make predictions based on systematic observations , Avoid the occurrence of abnormalities beyond the toughness of the system itself . Go one step further , We will continue to polish the project , Integrate more distributed systems through ecological cooperation , Try to make OpenChaos The ballast stone that creates the toughness structure of the component cloth system , So as to promote the continuous evolution of cloud native architecture , Only when the time is critical “ Let the wind and the waves rise , Take a fishing boat ”.
[1] Open source summer 2022 Activities :
https://summer-ospp.ac.cn/#/org/prodetail/221bf0008
Author's brief introduction :
Siying , Senior R & D Engineer , Consistency algorithms for distributed systems , Resilient architecture , Pattern recognition has deep understanding and research .
Jiahao , Senior middleware R & D Engineer , Responsible for the design and R & D of Huawei cloud distributed middleware , Good at middleware performance optimization , I like the design concept of simplicity .
Mahai , Huawei cloud middleware reliability technology expert , Good at chaos Engineering 、 Performance testing , Event driven architecture design .
边栏推荐
- 在微服务架构中管理技术债务
- Qualcomm WLAN framework learning (29) -- 6GHz overview
- Hamad application layout scheme 03 of hashicopy (run a job)
- Flutter 3.0 was officially released: it stably supports 6 platforms, and byte jitter is the main user
- PowerShell主架构师:我用业余时间开发项目,表现优秀反而被微软降级了
- Hamad application layout scheme of hashicopy 01
- Telecommuting with cpolar (1)
- 架构概念探索:以开发纸牌游戏为例
- 以 Log4j 为例,如何评估和划分安全风险
- Hashicopy之nomad应用编排方案03(运行一个job)
猜你喜欢

高数_第6章无穷级数__马克劳林级数

清北力压耶鲁,MIT蝉联第一,2023QS世界大学排名最新发布

North China pushed Yale hard, MIT won the first place in a row, and the latest 2023qs world university ranking was released

HMS core shows the latest open capabilities in mwc2022, helping developers build high-quality applications

腾讯面试官分享面试经验,如何考察面试者技术及个人综合素质,给正在面试的你一点建议

【SystemVerilog 之 接口】~ Interface

Task manager based on Qt development

Cartoon: interesting "cake cutting" problem

深度剖析「圈组」关系系统设计 | 「圈组」技术系列文章

树莓派知识大扫盲
随机推荐
浙江大学搞出了一款无人机,自动规避障碍,像鸟一样穿过树林,真正的蜂群来了...
Leetcode 1962. Remove stones to minimize the total amount (should be rounded up)
North China pushed Yale hard, MIT won the first place in a row, and the latest 2023qs world university ranking was released
In depth research and analysis report on global and Chinese liquid malt extract products market
数据库优化
uniapp设置页面跳转效果 - navigateTo切换效果 - 全局animationType动画
After many years of digital transformation projects, the main architects are desperate: outsourcing should not have been used at the beginning!
Tencent interviewers share their interview experience, how to evaluate the interviewers' technical and personal comprehensive quality, and give you some suggestions on the interview
Hashicopy之nomad应用编排方案02
Repository Manager之Nexus
C语言简易版webserver
Online "comment explicit" function, TME's wave point music cultivates music "private plots"
In depth research and analysis report on global and Chinese gas monitor market
In depth research and analysis report on global and Chinese sanitary safety product market
深度剖析「圈组」关系系统设计 | 「圈组」技术系列文章
Raspberry pie obtains the function of network installation system without the help of other devices
一些经典的嵌入式C面试题汇总
详解 Kubernetes 包管理工具 Helm
[verification of SystemVerilog] ~ test platform, hardware design description, excitation generator, monitor and comparator
Ali, tell me about the application scenarios of message oriented middleware?