当前位置:网站首页>Chaotic engineering practice of distributed kV storage in station B

Chaotic engineering practice of distributed kV storage in station B

2022-06-12 12:55:00 From big data to artificial intelligence

The author of this issue

Pengliangyou

BiliBili Senior Test Development Engineer

be responsible for B Station infrastructure storage / Microservice quality assurance , He has been engaged in the quality engineering construction of middleware , Focus on the design of distributed system test scheme , Application and promotion .

01 background

We introduced B Station distribution KV Stored in B Station exploration and practice (← Click to review the previous article ). This article mainly introduces the high reliability 、 High availability 、 High performance 、 Highly scalable B Station distribution KV How to guarantee the reliability of the storage system and the implementation of chaos engineering .

02 The difficulty of large-scale distributed storage

It is well known that large distributed systems are difficult to design , Development and testing [1], The challenge is multifaceted : Unexpected node failure , Asynchronous interaction logic of different modules , Data loss caused by network communication failure , Multi core CPU Multithreaded code and various client logic , All these factors can easily lead to very serious online accidents , These extremes are very difficult to find , Identify and repair , These problems can be a very simple line of code in a parallel transaction , Very small probability will trigger , But the end result could be disastrous .

Based on this, the reliability test of traditional large-scale distributed systems needs very complete test cases , To cover all possible influencing factors , Usually based on the idea of layering , According to the development, test, operation and maintenance experience and historical online case To establish a set of reliable scenarios and orthogonally cover various concurrency and boundary value cases , Verify one by one in the test environment , Whether the reliability design meets the expectation and is perfect , However, it is inevitable that some unexpected or multi factor faults will be omitted .

This is especially true in large distributed storage systems , Further attention should be paid to data consistency and persistence , Any data error or loss is unacceptable to the core business , Beyond the existing distributed system reliability scenarios , We also need to design scenarios related to data consistency and persistence, as well as test and verification schemes . The distributed storage industry has been developing for decades , Each commercial storage team also has its own open source testing framework , Well known, such as P#[2] and Jepsen[3], But these frameworks are expensive to apply , It is difficult to apply additional manpower to the existing iterative development process in the non-commercial storage team , And high requirements for operators , Often requires special programming languages and framework learning costs .

03 The significance of chaos engineering

chaos theory [4] There are practices in various research fields , Chaos does not mean that the complexity of a transaction becomes unpredictable beyond a certain critical point , It is a method of thinking and quantitative analysis , When discussing a dynamic system, we must use the whole and continuous relationship to explain and predict its behavior , For the reliability of a distributed system, we can also use this method to think and analyze its integrity and continuity , The corresponding chaotic Engineering .

Chaos engineering is especially suitable for distributed environment , A distributed system is a group of nodes that connect and share resources through a network . For large distributed systems , Components often have complex and unpredictable dependencies , It is difficult to rule out errors or predict when errors will occur .

There are many ways in which distributed systems can fail , The bigger the system 、 More complicated , The more unpredictable and chaotic its behavior .

Chaos engineering experiments deliberately produce various fault anomalies in distributed systems , To test the system and find out the reliability optimization points , Some examples that may be found include :

  • Blind spots : Monitor missing or hard to reach places .
  • Hidden mistakes : Failure under low probability and extreme scenarios , For example, multi scene superposition .
  • Performance bottleneck : Efficiency and performance issues , And the implementation is unreasonable .

2008 year 8 month Netfilx The database storage failure caused a three-day outage , Then they developed the corresponding testing tools , And in 2015 Years issued 《 Principles of chaos Engineering 》[5], By using chaos in the complex system , Enhance the ability and confidence of storage product reliability solutions to deal with chaos , These experiments follow four steps :

  1. It is defined by some measurable outputs under the normal behavior of the system “ Steady state ”.
  2. It is assumed that this will remain stable in the control group and the experimental group .
  3. Inject various event factors that may actually occur on the real line into the experimental group .
  4. The influence of this factor was verified by the state difference between the control group and the experimental group .

At the same time, it puts forward 5 High level principles , In practice, it is necessary to combine the actual use environment and organizational form of each system when applying the chaos engineering method , Here we will add some changes in our practice :

  • Establish a hypothesis around steady state behavior . Define whether the business is stable , It is not a system resource indicator , The indicators here are business indicators that directly measure the service quality of the system , For example, whether the task is successfully executed / execution time , Request delay , Error rate, etc , Even design smoke regression cases to detect .
  • Simulate real or possible fault scenarios in the production environment . Like network latency , Service exception , There are also storage system copy exceptions .
  • Running experiments in a production environment . You don't have to run online in a production environment , But the experimental environment should be as real as possible . In the practice of chaos Engineering , The responsibilities of each member of each team are different or the reliability of the system is not enough Netflix Run in a production environment . Make sure the test environment is real enough , The practice of chaos engineering is more valuable , So in large distributed storage systems , We need to ensure that the test environment can replicate the online real environment in equal proportion , And various types of different businesses and loads .
  • Continuous automatic operation experiment . There are two reasons for this , On the one hand, a single execution has limited significance for a real system that is constantly changing , Because the system code may be iterating , User scenarios are also changing , Only continuous operation can reduce the regression problem of fault recurrence ; On the other hand, many fault scenarios occur only after a certain probability, or even trigger after hundreds of times of operation , Only by relying on continuous automatic operation can these faults be covered .
  • Minimize the explosion radius . This requires fault injection tools to provide fine-grained configuration and control capabilities , Because the experimental environment may have multiple uses or even actual production , Prevent interference with other tests and services .

04 Chaos engineering practice

4.1 Establish steady state hypothesis

B Station distribution KV Storage is developed iteratively with business requirements , Functional requirements gradually cover all business lines of the company , Performance and reliability are also the process of gradual improvement and optimization , In practice, the steady-state standards need to be constantly updated and optimized .

4.2 Real user scenarios

For better test results , The test environment must simulate the real environment as much as possible , Replicate the online real environment in such proportions as logical deployment and architecture diagram :

  • Hardware configuration of the same specification , Access layer container deployment , Data layer physical machine deployment .
  • Two sets KV Storage cluster , Simulate the deployment of physical isolation in multiple computer rooms .
  • A single cluster is deployed in multiple ways region Partition .
  • Single region The partition contains... Of the same size raft group.
  • Contains a variety of different storage engines , Adapt to the community , live broadcast , game , Account number and other different business applications
  • Build different load models and data models

4.3 Design and continuously run experiments based on real online scenarios

The data migration scenario is used to illustrate the main components of chaos experiment scenario design , By integrating the real user scenario use cases and different faults identified in the functional test and exception test Monkey After random combination, the chaotic experiment of the scene is obtained and continues to run in the test environment , Find the most reliable fault points , Enhance the confidence of R & D personnel in the group in the system . The following describes the stored data migration scenario as an example :

4.3.1 Steady state index

Collect various data operations through business monitoring PUT/GET/DEL Operation delay , Request success rate , Detect data loss , And resource monitoring , Agree on the threshold of each index of the system under the experimental scenario . For the storage system, it is also necessary to verify the persistence and consistency of data , And internal tasks and node status can be checked through monitoring collection and scripts .

4.3.2 User scenarios

On the one hand, the construction of user scenarios needs to simulate online traffic , You can use the recording and playback of online requests or test the traffic of custom rules , On the other hand, it simulates various tasks and operation and maintenance operations , For example, in the migration scenario , You need to cover the expansion and contraction of the cluster , Cross region migration, task cancellation and other scenarios are put into the experimental process , At this time, a real one needed for a chaotic experiment is completed “ steady-state ” User scenario creation .

– Encapsulate the background traffic and check the return value of the request

  • Single business request PUT/GET/DEL
  • Batch request PUT/GET/DEL

– Encapsulate data migration scenarios and check task status

  • Capacity expansion in this area
  • The volume of this area is reduced
  • Cross regional migration
  • Cross regional migration cancel
// Encapsulate business requests PUT/GET/DEL And continuously check the data status go func() {    common.PutGetDelLoop(t, true, b.Client, 1000000, 300)    close(done1)}() // Encapsulate bulk requests PUT/GET/DEL And continuously check the data status go func() {    common.PutGetDelBatch(t, true, b.Client)    close(done2)}() // Encapsulate user scenarios : Data migration and check task status resp := common.RebalanceTable(base.RemoteServer, common.REBALANCE_TABLE, Table, "0:50%,9:50")assert.Contains(t, resp, "OK")log.Info("Rabalance plan: %s", "0:50%,9:50")

4.3.3 Fault injection encapsulation

In terms of system fault injection , On the one hand, you need to install on the target node agent Operate each storage node , Data table objects of various storage engines , For example, machine nodes , Data sheet information , On the other hand, it is necessary to analyze various fault types Monkey encapsulate , such as CPUMonkey,MemMonkey, copy Monkey etc. , And construct in the experiment ChaosTest The experimental subjects and various Monkey Are combined , The experiment can be run continuously in a fixed sequence or in a random combination , The experiment can be replayed and reused .

  • Various fault types encapsulate and bind experimental objects
class Monkeys {public:    Monkeys() {        //  Define the target node         m_hosts.push_back("172.22.12.25:8000");        m_hosts.push_back("172.22.12.31:8000");        m_hosts.push_back("172.22.12.37:8000");        srand(time(0));         //  Define various target subjects         m_tables.push_back("test_granite");        m_tables.push_back("test_quartz");        m_tables.push_back("test_pebble");        m_tables.push_back("test_marble_k16");        // ...    }     //  encapsulation  CPU monkey  Inject CPU Types of abnormal     void cpu_monkey() {        std::string host = m_hosts[rand() % m_hosts.size()];        cpu_load(host);        LOG_INFO "CPU MONKEY:"     }     //  encapsulation  mem monkey  The injected memory type is abnormal     void mem_monkey() {        std::string host = m_hosts[rand() % m_hosts.size()];        mem_load(host);        LOG_INFO "MEM MONKEY:"     }     //  encapsulation  replica monkey  Injection copy loss exception     void replica_monkey() {        std::string table = m_tables[rand() % m_tables.size()];        drop_replica(table);        LOG_INFO "Replica MONKEY:"     }     //  Encapsulate various types  monkey ...}
  • Define chaos experiment bind each monkey And keep running
class ChaosTest : public ::testing::Test {protected:    ChaosTest() {        m_monkeyVec.push_back(std::bind(&Monkeys::cpu_monkey, &m_monkeys));        m_monkeyVec.push_back(std::bind(&Monkeys::mem_monkey, &m_monkeys));        m_monkeyVec.push_back(std::bind(&Monkeys::disk_monkey, &m_monkeys));        m_monkeyVec.push_back(std::bind(&Monkeys::network_monkey, &m_monkeys));        m_monkeyVec.push_back(std::bind(&Monkeys::kill_node_monkey, &m_monkeys));        m_monkeyVec.push_back(std::bind(&Monkeys::stop_node_monkey, &m_monkeys));        m_monkeyVec.push_back(std::bind(&Monkeys::stop_meta_monkey, &m_monkeys));        m_monkeyVec.push_back(std::bind(&Monkeys::replica_monkey, &m_monkeys));...

4.4 Recording and analysis of experimental results

Before the experiment , In the experiments , After the experiment, collect various indicators and data , Used for final data analysis and optimization of landing stability improvement . And through the unattended and persistent operation of chaos experiment , Continue to find more probabilistic problems and optimize system stability .

  • Experiment operation system operation log monitoring
  • Service function use case regression verification after experiment
  • Test operation index data and monitoring data passed Prometheus Data collection and persistence
  • Experimental Kanban Grafana Realize visualization and alarm

05 Result income

During the continuous operation of the whole project practice 1 In half a year , Intercepted serious system problems of multiple scene overlays , For example, in the case of superimposed copy loss scenarios under the pressure of big data, internal asynchronous thread competition may occur raft Node exception , Similar problems are difficult to find in traditional reliability failure scenarios , At the same time, compared with the traditional distributed testing framework, its input-output ratio has many advantages :

  • Compared with the traditional storage testing framework, the maintenance cost is low
  • It can cover more real business scenarios , Avoid missing complex scenes .
  • In line with the development of product maturity and iteration progress, gradually develop and optimize the evolution .
  • Low incremental development and maintenance costs , Comply with opening and closing principle , The new scene will not interfere with the original experiment .

06 Standardization and service

6.1 Standardization

2021 In, the Chinese Academy of communications and communications issued a practical guide to chaos engineering [6], It can be used to evaluate the ability of organizational structure practice and chaotic engineering practice , Reflect the feasibility of chaotic engineering practice , Effectiveness and safety . Chaos engineering experiment needs standardization construction with the continuous improvement of system architecture capability .

6.2 As a service

There are many kinds of chaos engineering tools , Most fault injection tools are open source , Such as Chaos Blade and Chaos Mesh. But the system architectures of different companies are different , In practice, we need to further integrate and apply various fault injection tools to form our own service platform .

Reference resources

[1] https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/paper-1.pdf

[2] https://github.com/p-org/PSharp

[3] https://github.com/jepsen-io/jepsen

[4] https://baike.baidu.com/item/ chaos theory

[5] https://principlesofchaos.org/

[6] http://www.caict.ac.cn/kxyj/qwfb/ztbg/202112/P020211223588643401747.pdf

This article is for bloggers from big data to artificial intelligence 「xiaozhch5」 The original article of , follow CC 4.0 BY-SA Copyright agreement , For reprint, please attach the original source link and this statement .

Link to the original text :https://lrting.top/backend/5948/

原网站

版权声明
本文为[From big data to artificial intelligence]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/163/202206121248379288.html