当前位置：网站首页>Sre: Google operation and maintenance decryption

Sre: Google operation and maintenance decryption

2022-07-27 21:13:00 【It's me】

First contact with 《SRE：Google Operation and decryption 》 It was three years ago , At that time, I didn't read carefully for my own reasons , For some of the knowledge in the book is also a superficial taste , It may be because of the pit you climbed at that time , Too few falls , I don't have a deep understanding of it , I gradually forget a lot about it .

Recently, the technical committee resurrected this book , I read it carefully again , We found that in the past three years, we have made great efforts to it , But there is a lack of systematic thinking , So everyone has a feeling of hammer in the East and hammer in the West , I feel like I did something and I feel like I didn't do anything .

This book is divided into four parts ： overview 、 guiding ideology 、 Concrete practice 、 management . The book is about SRE The following definitions are made “SRE The profession focuses on the life cycle management of the whole software system . From its design to deployment , After continuous improvement , Finally, he retired smoothly , Such a profession must have a very wide range of skills , But the focus is different from that of other professions . First SRE yes The engineer , secondly SRE Our focus is reliability ”. In this part, I will talk about my personal understanding , From the definition in the book combined with my personal experience ,SRE The level of personnel requirements is actually very high , This kind of work is not an ordinary SaaS Level development can be competent , It requires personnel to have SaaS/PaaS/IaaS Three levels of experience , Have an architectural design 、 software development 、 Knowledge of operation and maintenance , I have my own understanding of stability , If a system goes online , Users cannot use stably , Then there is no meaning of existence , and SRE What people have to do is , Use up what you have learned , Try to make the whole system run more reliably , More effective use of resources , But it doesn't mean SRE People should pursue 100% Stable and reliable , Because of the pursuit of 100% reliable , The ratio of income to pay is too low .

The book embraces risk 、 Service quality objectives 、 Reduce trivia 、 monitor 、 Release 、 automation 、 Simplification puts forward a series of guiding ideas . Personally, I think the most essence of it is the proposed monitoring system 4 A gold indicator ： Delay 、 Traffic 、 error and saturation .“ Delay refers to the time required by the service to process a request , Traffic refers to the measurement of a high-level indicator in the system against the system load demand , Errors refer to the rate at which requests fail , Saturation refers to the measurement of a specific indicator of a resource that is currently the most Limited ”. Here is my personal understanding , The concept of delay is easy to understand , But one thing to note is to distinguish the delay of error reply from that of normal reply , If you don't distinguish , The indicator of low delay has no practical significance . What I said about the flow is rather awkward , In fact, we can understand as follows ： If it is WEB That's every second HTTP The number of requests , If it's a file server, it's a network I/O rate , For the database, it is the number of read operations per second . The key of the error indicator is implicit failure , For example, everyone is concerned about HTTP Request returns 500 This kind of failure is definitely not missed when monitoring , But for example HTTP The request returned as 200 But this implicit failure with errors inside , The attention is not high , But often the real business error lies in this implicit failure . To monitor such indicators , It is necessary to monitor and analyze the program and make targeted adaptation development for the return value of the program , To adapt the status code of the business itself , In fact, it is extended here , The importance of unifying status return codes within a company , The next level is the standardized interface return value format , When this place is unified , Then targeted adaptive development becomes a public capability within the company , Avoid the embarrassment of repeatedly building wheels . About the last concept of saturation , The first is to get a peak flow of the system , Just take one WEB For service , First, we need to obtain the peak value of its processing requests through various means , Then compare the flow above with this peak , To get the peak value of this service ; In fact, this is not just the perspective of a single service , A higher level is the perspective of the system , Extract the core services of the system , Core components , Calculate saturation , The saturation of the current whole system can be obtained , This indicator is of great significance for monitoring the normal operation of the system for a period of time .

The third part of the book introduces specific practice , Personally, I think the most critical part is the reliability hierarchy model of services ： From the bottom up are ： monitor 、 Emergency response 、 Post event summary / Root cause analysis 、 Test release 、 Capacity planning 、 software development 、 The product design . Mastering the reliability hierarchy model of services is equivalent to having a checklist of stability guarantee , When taking over a system , Ask yourself ： Whether the monitoring of the system covers , Whether the requirements of the four gold indicators have been followed ？ For the problems of the system , Is there an emergency handling process , Is there an emergency plan ？ If it's an old system , Is there any historical fault record, post event summary and problem root cause analysis ？ If it is a new system, is it connected to the company's internal fault management system （ Business continuity management platform ）？ Does the system have a standardized test release process ？ Is there a targeted increase in stability related tests , Ensure that the software will not have some common universality problems when it is released to the production environment ？ for instance ： Common problems such as business logic errors caused by boundary values ？ Have you done capacity planning for the system , Whether the load balancing system can correctly use these capacities ？ Is there a standardized process and framework for software development and product design ？ When we think systematically , We will have new harvest , And the solution is more perfect .

The fourth part of the book mainly introduces how to quickly cultivate SRE Join in on-call, Handle disruptive tasks ,SRE Communication and collaboration with other teams , as well as SRE The evolution of participation mode . This part mainly introduces the knowledge of management , But I personally think a very important point is the description of the three participation models ; Simple PRR Model , Early participation model , The framework and SRE platform . The framework and SRE Platform mode , Provides many benefits , for instance ： Significantly reduce operation and maintenance costs , Because it supports code structure 、 Dependency relationship 、 test 、 Strong compliance testing of coding style guidelines, etc , Built in service deployment 、 Monitoring and Automation , The built-in versatility support in the design , In this framework, code patterns based on production best practices are standardized and encapsulated , Give Way SRE Reduce the burden of cognition in management , At the same time, the quality of service can still be maintained , Each standard framework provides a complete solution for the problem area or the problem related infrastructure since its establishment .

This book is not a theoretical boast , It can be easily used by others SRE Reused by the team , This book is published in 2016, There are already 6 A year , When I look back at this book , I was surprised to find that , The theories and solutions in the book are still applicable , About SRE The fundamental responsibility and main focus of attention have remained basically unchanged in the past decade , Personally, I think it's too conservative .

原网站

版权声明
本文为[It's me]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207271818466672.html

当前位置：网站首页>Sre: Google operation and maintenance decryption

Sre: Google operation and maintenance decryption

边栏推荐

猜你喜欢

随机推荐