当前位置:网站首页>Chaotic engineering practice of distributed kV storage in station B
Chaotic engineering practice of distributed kV storage in station B
2022-06-12 12:55:00 【From big data to artificial intelligence】
The author of this issue
Pengliangyou
BiliBili Senior Test Development Engineer
be responsible for B Station infrastructure storage / Microservice quality assurance , He has been engaged in the quality engineering construction of middleware , Focus on the design of distributed system test scheme , Application and promotion .
01 background
We introduced B Station distribution KV Stored in B Station exploration and practice (← Click to review the previous article ). This article mainly introduces the high reliability 、 High availability 、 High performance 、 Highly scalable B Station distribution KV How to guarantee the reliability of the storage system and the implementation of chaos engineering .
02 The difficulty of large-scale distributed storage
It is well known that large distributed systems are difficult to design , Development and testing [1], The challenge is multifaceted : Unexpected node failure , Asynchronous interaction logic of different modules , Data loss caused by network communication failure , Multi core CPU Multithreaded code and various client logic , All these factors can easily lead to very serious online accidents , These extremes are very difficult to find , Identify and repair , These problems can be a very simple line of code in a parallel transaction , Very small probability will trigger , But the end result could be disastrous .
Based on this, the reliability test of traditional large-scale distributed systems needs very complete test cases , To cover all possible influencing factors , Usually based on the idea of layering , According to the development, test, operation and maintenance experience and historical online case To establish a set of reliable scenarios and orthogonally cover various concurrency and boundary value cases , Verify one by one in the test environment , Whether the reliability design meets the expectation and is perfect , However, it is inevitable that some unexpected or multi factor faults will be omitted .
This is especially true in large distributed storage systems , Further attention should be paid to data consistency and persistence , Any data error or loss is unacceptable to the core business , Beyond the existing distributed system reliability scenarios , We also need to design scenarios related to data consistency and persistence, as well as test and verification schemes . The distributed storage industry has been developing for decades , Each commercial storage team also has its own open source testing framework , Well known, such as P#[2] and Jepsen[3], But these frameworks are expensive to apply , It is difficult to apply additional manpower to the existing iterative development process in the non-commercial storage team , And high requirements for operators , Often requires special programming languages and framework learning costs .
03 The significance of chaos engineering
chaos theory [4] There are practices in various research fields , Chaos does not mean that the complexity of a transaction becomes unpredictable beyond a certain critical point , It is a method of thinking and quantitative analysis , When discussing a dynamic system, we must use the whole and continuous relationship to explain and predict its behavior , For the reliability of a distributed system, we can also use this method to think and analyze its integrity and continuity , The corresponding chaotic Engineering .
Chaos engineering is especially suitable for distributed environment , A distributed system is a group of nodes that connect and share resources through a network . For large distributed systems , Components often have complex and unpredictable dependencies , It is difficult to rule out errors or predict when errors will occur .
There are many ways in which distributed systems can fail , The bigger the system 、 More complicated , The more unpredictable and chaotic its behavior .
Chaos engineering experiments deliberately produce various fault anomalies in distributed systems , To test the system and find out the reliability optimization points , Some examples that may be found include :
- Blind spots : Monitor missing or hard to reach places .
- Hidden mistakes : Failure under low probability and extreme scenarios , For example, multi scene superposition .
- Performance bottleneck : Efficiency and performance issues , And the implementation is unreasonable .
2008 year 8 month Netfilx The database storage failure caused a three-day outage , Then they developed the corresponding testing tools , And in 2015 Years issued 《 Principles of chaos Engineering 》[5], By using chaos in the complex system , Enhance the ability and confidence of storage product reliability solutions to deal with chaos , These experiments follow four steps :
- It is defined by some measurable outputs under the normal behavior of the system “ Steady state ”.
- It is assumed that this will remain stable in the control group and the experimental group .
- Inject various event factors that may actually occur on the real line into the experimental group .
- The influence of this factor was verified by the state difference between the control group and the experimental group .
At the same time, it puts forward 5 High level principles , In practice, it is necessary to combine the actual use environment and organizational form of each system when applying the chaos engineering method , Here we will add some changes in our practice :
- Establish a hypothesis around steady state behavior . Define whether the business is stable , It is not a system resource indicator , The indicators here are business indicators that directly measure the service quality of the system , For example, whether the task is successfully executed / execution time , Request delay , Error rate, etc , Even design smoke regression cases to detect .
- Simulate real or possible fault scenarios in the production environment . Like network latency , Service exception , There are also storage system copy exceptions .
- Running experiments in a production environment . You don't have to run online in a production environment , But the experimental environment should be as real as possible . In the practice of chaos Engineering , The responsibilities of each member of each team are different or the reliability of the system is not enough Netflix Run in a production environment . Make sure the test environment is real enough , The practice of chaos engineering is more valuable , So in large distributed storage systems , We need to ensure that the test environment can replicate the online real environment in equal proportion , And various types of different businesses and loads .
- Continuous automatic operation experiment . There are two reasons for this , On the one hand, a single execution has limited significance for a real system that is constantly changing , Because the system code may be iterating , User scenarios are also changing , Only continuous operation can reduce the regression problem of fault recurrence ; On the other hand, many fault scenarios occur only after a certain probability, or even trigger after hundreds of times of operation , Only by relying on continuous automatic operation can these faults be covered .
- Minimize the explosion radius . This requires fault injection tools to provide fine-grained configuration and control capabilities , Because the experimental environment may have multiple uses or even actual production , Prevent interference with other tests and services .
04 Chaos engineering practice
4.1 Establish steady state hypothesis
B Station distribution KV Storage is developed iteratively with business requirements , Functional requirements gradually cover all business lines of the company , Performance and reliability are also the process of gradual improvement and optimization , In practice, the steady-state standards need to be constantly updated and optimized .
4.2 Real user scenarios
For better test results , The test environment must simulate the real environment as much as possible , Replicate the online real environment in such proportions as logical deployment and architecture diagram :
- Hardware configuration of the same specification , Access layer container deployment , Data layer physical machine deployment .
- Two sets KV Storage cluster , Simulate the deployment of physical isolation in multiple computer rooms .
- A single cluster is deployed in multiple ways region Partition .
- Single region The partition contains... Of the same size raft group.
- Contains a variety of different storage engines , Adapt to the community , live broadcast , game , Account number and other different business applications
- Build different load models and data models
4.3 Design and continuously run experiments based on real online scenarios
The data migration scenario is used to illustrate the main components of chaos experiment scenario design , By integrating the real user scenario use cases and different faults identified in the functional test and exception test Monkey After random combination, the chaotic experiment of the scene is obtained and continues to run in the test environment , Find the most reliable fault points , Enhance the confidence of R & D personnel in the group in the system . The following describes the stored data migration scenario as an example :
4.3.1 Steady state index
Collect various data operations through business monitoring PUT/GET/DEL Operation delay , Request success rate , Detect data loss , And resource monitoring , Agree on the threshold of each index of the system under the experimental scenario . For the storage system, it is also necessary to verify the persistence and consistency of data , And internal tasks and node status can be checked through monitoring collection and scripts .
4.3.2 User scenarios
On the one hand, the construction of user scenarios needs to simulate online traffic , You can use the recording and playback of online requests or test the traffic of custom rules , On the other hand, it simulates various tasks and operation and maintenance operations , For example, in the migration scenario , You need to cover the expansion and contraction of the cluster , Cross region migration, task cancellation and other scenarios are put into the experimental process , At this time, a real one needed for a chaotic experiment is completed “ steady-state ” User scenario creation .
– Encapsulate the background traffic and check the return value of the request
- Single business request PUT/GET/DEL
- Batch request PUT/GET/DEL
– Encapsulate data migration scenarios and check task status
- Capacity expansion in this area
- The volume of this area is reduced
- Cross regional migration
- Cross regional migration cancel
- …
// Encapsulate business requests PUT/GET/DEL And continuously check the data status go func() { common.PutGetDelLoop(t, true, b.Client, 1000000, 300) close(done1)}() // Encapsulate bulk requests PUT/GET/DEL And continuously check the data status go func() { common.PutGetDelBatch(t, true, b.Client) close(done2)}() // Encapsulate user scenarios : Data migration and check task status resp := common.RebalanceTable(base.RemoteServer, common.REBALANCE_TABLE, Table, "0:50%,9:50")assert.Contains(t, resp, "OK")log.Info("Rabalance plan: %s", "0:50%,9:50")4.3.3 Fault injection encapsulation
In terms of system fault injection , On the one hand, you need to install on the target node agent Operate each storage node , Data table objects of various storage engines , For example, machine nodes , Data sheet information , On the other hand, it is necessary to analyze various fault types Monkey encapsulate , such as CPUMonkey,MemMonkey, copy Monkey etc. , And construct in the experiment ChaosTest The experimental subjects and various Monkey Are combined , The experiment can be run continuously in a fixed sequence or in a random combination , The experiment can be replayed and reused .
- Various fault types encapsulate and bind experimental objects
class Monkeys {public: Monkeys() { // Define the target node m_hosts.push_back("172.22.12.25:8000"); m_hosts.push_back("172.22.12.31:8000"); m_hosts.push_back("172.22.12.37:8000"); srand(time(0)); // Define various target subjects m_tables.push_back("test_granite"); m_tables.push_back("test_quartz"); m_tables.push_back("test_pebble"); m_tables.push_back("test_marble_k16"); // ... } // encapsulation CPU monkey Inject CPU Types of abnormal void cpu_monkey() { std::string host = m_hosts[rand() % m_hosts.size()]; cpu_load(host); LOG_INFO "CPU MONKEY:" } // encapsulation mem monkey The injected memory type is abnormal void mem_monkey() { std::string host = m_hosts[rand() % m_hosts.size()]; mem_load(host); LOG_INFO "MEM MONKEY:" } // encapsulation replica monkey Injection copy loss exception void replica_monkey() { std::string table = m_tables[rand() % m_tables.size()]; drop_replica(table); LOG_INFO "Replica MONKEY:" } // Encapsulate various types monkey ...}- Define chaos experiment bind each monkey And keep running
class ChaosTest : public ::testing::Test {protected: ChaosTest() { m_monkeyVec.push_back(std::bind(&Monkeys::cpu_monkey, &m_monkeys)); m_monkeyVec.push_back(std::bind(&Monkeys::mem_monkey, &m_monkeys)); m_monkeyVec.push_back(std::bind(&Monkeys::disk_monkey, &m_monkeys)); m_monkeyVec.push_back(std::bind(&Monkeys::network_monkey, &m_monkeys)); m_monkeyVec.push_back(std::bind(&Monkeys::kill_node_monkey, &m_monkeys)); m_monkeyVec.push_back(std::bind(&Monkeys::stop_node_monkey, &m_monkeys)); m_monkeyVec.push_back(std::bind(&Monkeys::stop_meta_monkey, &m_monkeys)); m_monkeyVec.push_back(std::bind(&Monkeys::replica_monkey, &m_monkeys));...4.4 Recording and analysis of experimental results
Before the experiment , In the experiments , After the experiment, collect various indicators and data , Used for final data analysis and optimization of landing stability improvement . And through the unattended and persistent operation of chaos experiment , Continue to find more probabilistic problems and optimize system stability .
- Experiment operation system operation log monitoring
- Service function use case regression verification after experiment
- Test operation index data and monitoring data passed Prometheus Data collection and persistence
- Experimental Kanban Grafana Realize visualization and alarm
05 Result income
During the continuous operation of the whole project practice 1 In half a year , Intercepted serious system problems of multiple scene overlays , For example, in the case of superimposed copy loss scenarios under the pressure of big data, internal asynchronous thread competition may occur raft Node exception , Similar problems are difficult to find in traditional reliability failure scenarios , At the same time, compared with the traditional distributed testing framework, its input-output ratio has many advantages :
- Compared with the traditional storage testing framework, the maintenance cost is low
- It can cover more real business scenarios , Avoid missing complex scenes .
- In line with the development of product maturity and iteration progress, gradually develop and optimize the evolution .
- Low incremental development and maintenance costs , Comply with opening and closing principle , The new scene will not interfere with the original experiment .
06 Standardization and service
6.1 Standardization
2021 In, the Chinese Academy of communications and communications issued a practical guide to chaos engineering [6], It can be used to evaluate the ability of organizational structure practice and chaotic engineering practice , Reflect the feasibility of chaotic engineering practice , Effectiveness and safety . Chaos engineering experiment needs standardization construction with the continuous improvement of system architecture capability .
6.2 As a service
There are many kinds of chaos engineering tools , Most fault injection tools are open source , Such as Chaos Blade and Chaos Mesh. But the system architectures of different companies are different , In practice, we need to further integrate and apply various fault injection tools to form our own service platform .
Reference resources
[1] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/paper-1.pdf
[2] https://github.com/p-org/PSharp
[3] https://github.com/jepsen-io/jepsen
[4] https://baike.baidu.com/item/ chaos theory
[5] https://principlesofchaos.org/
[6] http://www.caict.ac.cn/kxyj/qwfb/ztbg/202112/P020211223588643401747.pdf
This article is for bloggers from big data to artificial intelligence 「xiaozhch5」 The original article of , follow CC 4.0 BY-SA Copyright agreement , For reprint, please attach the original source link and this statement .
Link to the original text :https://lrting.top/backend/5948/
边栏推荐
- 什么时候运用二分搜索
- 分享PDF高清版,系列篇
- Embedded system hardware composition - embedded system hardware architecture
- unittest框架
- 嵌入式系统概述1-嵌入式系统定义、特点和发展历程
- One line of code to implement shell if else logic
- ITK 原图种子点经过roi、降采样后index的变化
- A "murder case" caused by ES setting operation
- Uniapp wechat applet long press the identification QR code to jump to applet and personal wechat
- Binary tree (serialization)
猜你喜欢

嵌入式系统概述1-嵌入式系统定义、特点和发展历程

Share PDF HD version, series

Volume mount and mirror creation

机械臂改进的DH参数与标准DH参数理论知识

403 you don't have permission to access this resource

Eight misunderstandings are broken one by one (2): poor performance? Fewer applications? You worry a lot about the cloud!

Dasctf Sept x Zhejiang University of technology autumn challenge Web

Async/await for ES6

2022 ARTS|Week 23
![[database] Navicat -- Oracle database creation](/img/40/95d222acd0ae85bd9a4be66aa20d1d.png)
[database] Navicat -- Oracle database creation
随机推荐
Hardware composition of embedded system - introduction of embedded development board based on ARM
Dasctf Sept x Zhejiang University of technology autumn challenge Web
嵌入式驱动程序设计
嵌入式系统硬件构成-基于ARM的嵌入式开发板介绍
Object value taking method in JS And []
Overview of embedded system 1- definition, characteristics and development history of embedded system
442个作者100页论文!谷歌耗时2年发布大模型新基准BIG-Bench | 开源
配准后图像对比函数itk::CheckerBoardImageFilter
数组——双指针技巧秒杀七道数组题目
VTK three views
快速下载谷歌云盘大文件的5种方法
下一个职场演讲PPT的明星,会不会是此刻的你【完美总结】
【云原生 | Kubernetes篇】深入了解Deployment(八)
嵌入式系统概述3-嵌入式系统的开发流程和学习基础、方法
ITK Examples/RegistrationITKv4/DeformableRegistration
嵌入式系統硬件構成-基於ARM的嵌入式開發板介紹
Typescript and abstract classes
VNCTF2022 [WEB]
403 you don't have permission to access this resource
Binary tree (serialization)