当前位置:网站首页>Sre: Google operation and maintenance decryption
Sre: Google operation and maintenance decryption
2022-07-27 21:13:00 【It's me】
First contact with 《SRE:Google Operation and decryption 》 It was three years ago , At that time, I didn't read carefully for my own reasons , For some of the knowledge in the book is also a superficial taste , It may be because of the pit you climbed at that time , Too few falls , I don't have a deep understanding of it , I gradually forget a lot about it .
Recently, the technical committee resurrected this book , I read it carefully again , We found that in the past three years, we have made great efforts to it , But there is a lack of systematic thinking , So everyone has a feeling of hammer in the East and hammer in the West , I feel like I did something and I feel like I didn't do anything .
This book is divided into four parts : overview 、 guiding ideology 、 Concrete practice 、 management . The book is about SRE The following definitions are made “SRE The profession focuses on the life cycle management of the whole software system . From its design to deployment , After continuous improvement , Finally, he retired smoothly , Such a profession must have a very wide range of skills , But the focus is different from that of other professions . First SRE yes The engineer , secondly SRE Our focus is reliability ”. In this part, I will talk about my personal understanding , From the definition in the book combined with my personal experience ,SRE The level of personnel requirements is actually very high , This kind of work is not an ordinary SaaS Level development can be competent , It requires personnel to have SaaS/PaaS/IaaS Three levels of experience , Have an architectural design 、 software development 、 Knowledge of operation and maintenance , I have my own understanding of stability , If a system goes online , Users cannot use stably , Then there is no meaning of existence , and SRE What people have to do is , Use up what you have learned , Try to make the whole system run more reliably , More effective use of resources , But it doesn't mean SRE People should pursue 100% Stable and reliable , Because of the pursuit of 100% reliable , The ratio of income to pay is too low .
The book embraces risk 、 Service quality objectives 、 Reduce trivia 、 monitor 、 Release 、 automation 、 Simplification puts forward a series of guiding ideas . Personally, I think the most essence of it is the proposed monitoring system 4 A gold indicator : Delay 、 Traffic 、 error and saturation .“ Delay refers to the time required by the service to process a request , Traffic refers to the measurement of a high-level indicator in the system against the system load demand , Errors refer to the rate at which requests fail , Saturation refers to the measurement of a specific indicator of a resource that is currently the most Limited ”. Here is my personal understanding , The concept of delay is easy to understand , But one thing to note is to distinguish the delay of error reply from that of normal reply , If you don't distinguish , The indicator of low delay has no practical significance . What I said about the flow is rather awkward , In fact, we can understand as follows : If it is WEB That's every second HTTP The number of requests , If it's a file server, it's a network I/O rate , For the database, it is the number of read operations per second . The key of the error indicator is implicit failure , For example, everyone is concerned about HTTP Request returns 500 This kind of failure is definitely not missed when monitoring , But for example HTTP The request returned as 200 But this implicit failure with errors inside , The attention is not high , But often the real business error lies in this implicit failure . To monitor such indicators , It is necessary to monitor and analyze the program and make targeted adaptation development for the return value of the program , To adapt the status code of the business itself , In fact, it is extended here , The importance of unifying status return codes within a company , The next level is the standardized interface return value format , When this place is unified , Then targeted adaptive development becomes a public capability within the company , Avoid the embarrassment of repeatedly building wheels . About the last concept of saturation , The first is to get a peak flow of the system , Just take one WEB For service , First, we need to obtain the peak value of its processing requests through various means , Then compare the flow above with this peak , To get the peak value of this service ; In fact, this is not just the perspective of a single service , A higher level is the perspective of the system , Extract the core services of the system , Core components , Calculate saturation , The saturation of the current whole system can be obtained , This indicator is of great significance for monitoring the normal operation of the system for a period of time .
The third part of the book introduces specific practice , Personally, I think the most critical part is the reliability hierarchy model of services : From the bottom up are : monitor 、 Emergency response 、 Post event summary / Root cause analysis 、 Test release 、 Capacity planning 、 software development 、 The product design . Mastering the reliability hierarchy model of services is equivalent to having a checklist of stability guarantee , When taking over a system , Ask yourself : Whether the monitoring of the system covers , Whether the requirements of the four gold indicators have been followed ? For the problems of the system , Is there an emergency handling process , Is there an emergency plan ? If it's an old system , Is there any historical fault record, post event summary and problem root cause analysis ? If it is a new system, is it connected to the company's internal fault management system ( Business continuity management platform )? Does the system have a standardized test release process ? Is there a targeted increase in stability related tests , Ensure that the software will not have some common universality problems when it is released to the production environment ? for instance : Common problems such as business logic errors caused by boundary values ? Have you done capacity planning for the system , Whether the load balancing system can correctly use these capacities ? Is there a standardized process and framework for software development and product design ? When we think systematically , We will have new harvest , And the solution is more perfect .
The fourth part of the book mainly introduces how to quickly cultivate SRE Join in on-call, Handle disruptive tasks ,SRE Communication and collaboration with other teams , as well as SRE The evolution of participation mode . This part mainly introduces the knowledge of management , But I personally think a very important point is the description of the three participation models ; Simple PRR Model , Early participation model , The framework and SRE platform . The framework and SRE Platform mode , Provides many benefits , for instance : Significantly reduce operation and maintenance costs , Because it supports code structure 、 Dependency relationship 、 test 、 Strong compliance testing of coding style guidelines, etc , Built in service deployment 、 Monitoring and Automation , The built-in versatility support in the design , In this framework, code patterns based on production best practices are standardized and encapsulated , Give Way SRE Reduce the burden of cognition in management , At the same time, the quality of service can still be maintained , Each standard framework provides a complete solution for the problem area or the problem related infrastructure since its establishment .
This book is not a theoretical boast , It can be easily used by others SRE Reused by the team , This book is published in 2016, There are already 6 A year , When I look back at this book , I was surprised to find that , The theories and solutions in the book are still applicable , About SRE The fundamental responsibility and main focus of attention have remained basically unchanged in the past decade , Personally, I think it's too conservative .
边栏推荐
- Hexagon_ V65_ Programmers_ Reference_ Manual(5)
- Leetcode daily practice - cm11 linked list segmentation
- IOU target tracking II: viou tracker
- One of IOU target tracking: IOU tracker
- LeetCode每日一练 —— CM11 链表分割
- Knowledge management system promotes the development of enterprise informatization
- 访问共享文件夹时提示“因为文件共享不安全 SMB1协议”请使用SMB2协议
- Opencv implements image clipping and scaling
- A lock faster than read-write lock. Don't get to know it quickly
- 论文赏析[EMNLP18]用序列标注来进行成分句法分析
猜你喜欢

Automated testing - unittest framework

Digital leading planning first, focusing on the construction of intelligent planning information platform and the exploration and practice of application projects

中地数码:融合创新国产GIS 乘风而上助推实景三维中国建设

Smart Internet ran out of China's "acceleration", and the market reshuffle behind the 26.15% carrying rate

LeetCode每日一练 —— 21. 合并两个有序链表

Lidar China's front loading curtain opens, millions of production capacity to be digested

IOU target tracking II: viou tracker

Knowledge management system promotes the development of enterprise informatization

Automatic test solution based on ATX

【历史上的今天】7 月 27 日:模型检测先驱出生;微软收购 QDOS;第一张激光照排的中文报纸
随机推荐
JS closure knowledge
What if the start button doesn't respond after the win11 system updates kb5014668?
Second uncle, why is it so hot?
LeetCode-209-长度最小的子数组
人脸识别5.1- insightface人脸检测模型训练实战笔记
Beijing / Shanghai / Guangzhou / Shenzhen dama-cdga/cdgp data governance certification registration conditions
Hexagon_ V65_ Programmers_ Reference_ Manual(7)
Recommend a powerful search tool listary
中地数码:融合创新国产GIS 乘风而上助推实景三维中国建设
Set up discuz forum and break the stolen database
Lidar China's front loading curtain opens, millions of production capacity to be digested
Understanding network model overview of network model
How to translate the address in the program?
A method of MCU log output
Natapp intranet penetration tool Internet access personal projects
82.(cesium篇)cesium点在3d模型上运动
Get the method registered in the delegate
opencv实现图片裁剪和缩放
IPv4/IPv6、DHCP、网关、路由
Face recognition 5.1- insightface face face detection model training practice notes