当前位置:网站首页>Sre: Google operation and maintenance decryption
Sre: Google operation and maintenance decryption
2022-07-27 21:13:00 【It's me】
First contact with 《SRE:Google Operation and decryption 》 It was three years ago , At that time, I didn't read carefully for my own reasons , For some of the knowledge in the book is also a superficial taste , It may be because of the pit you climbed at that time , Too few falls , I don't have a deep understanding of it , I gradually forget a lot about it .
Recently, the technical committee resurrected this book , I read it carefully again , We found that in the past three years, we have made great efforts to it , But there is a lack of systematic thinking , So everyone has a feeling of hammer in the East and hammer in the West , I feel like I did something and I feel like I didn't do anything .
This book is divided into four parts : overview 、 guiding ideology 、 Concrete practice 、 management . The book is about SRE The following definitions are made “SRE The profession focuses on the life cycle management of the whole software system . From its design to deployment , After continuous improvement , Finally, he retired smoothly , Such a profession must have a very wide range of skills , But the focus is different from that of other professions . First SRE yes The engineer , secondly SRE Our focus is reliability ”. In this part, I will talk about my personal understanding , From the definition in the book combined with my personal experience ,SRE The level of personnel requirements is actually very high , This kind of work is not an ordinary SaaS Level development can be competent , It requires personnel to have SaaS/PaaS/IaaS Three levels of experience , Have an architectural design 、 software development 、 Knowledge of operation and maintenance , I have my own understanding of stability , If a system goes online , Users cannot use stably , Then there is no meaning of existence , and SRE What people have to do is , Use up what you have learned , Try to make the whole system run more reliably , More effective use of resources , But it doesn't mean SRE People should pursue 100% Stable and reliable , Because of the pursuit of 100% reliable , The ratio of income to pay is too low .
The book embraces risk 、 Service quality objectives 、 Reduce trivia 、 monitor 、 Release 、 automation 、 Simplification puts forward a series of guiding ideas . Personally, I think the most essence of it is the proposed monitoring system 4 A gold indicator : Delay 、 Traffic 、 error and saturation .“ Delay refers to the time required by the service to process a request , Traffic refers to the measurement of a high-level indicator in the system against the system load demand , Errors refer to the rate at which requests fail , Saturation refers to the measurement of a specific indicator of a resource that is currently the most Limited ”. Here is my personal understanding , The concept of delay is easy to understand , But one thing to note is to distinguish the delay of error reply from that of normal reply , If you don't distinguish , The indicator of low delay has no practical significance . What I said about the flow is rather awkward , In fact, we can understand as follows : If it is WEB That's every second HTTP The number of requests , If it's a file server, it's a network I/O rate , For the database, it is the number of read operations per second . The key of the error indicator is implicit failure , For example, everyone is concerned about HTTP Request returns 500 This kind of failure is definitely not missed when monitoring , But for example HTTP The request returned as 200 But this implicit failure with errors inside , The attention is not high , But often the real business error lies in this implicit failure . To monitor such indicators , It is necessary to monitor and analyze the program and make targeted adaptation development for the return value of the program , To adapt the status code of the business itself , In fact, it is extended here , The importance of unifying status return codes within a company , The next level is the standardized interface return value format , When this place is unified , Then targeted adaptive development becomes a public capability within the company , Avoid the embarrassment of repeatedly building wheels . About the last concept of saturation , The first is to get a peak flow of the system , Just take one WEB For service , First, we need to obtain the peak value of its processing requests through various means , Then compare the flow above with this peak , To get the peak value of this service ; In fact, this is not just the perspective of a single service , A higher level is the perspective of the system , Extract the core services of the system , Core components , Calculate saturation , The saturation of the current whole system can be obtained , This indicator is of great significance for monitoring the normal operation of the system for a period of time .
The third part of the book introduces specific practice , Personally, I think the most critical part is the reliability hierarchy model of services : From the bottom up are : monitor 、 Emergency response 、 Post event summary / Root cause analysis 、 Test release 、 Capacity planning 、 software development 、 The product design . Mastering the reliability hierarchy model of services is equivalent to having a checklist of stability guarantee , When taking over a system , Ask yourself : Whether the monitoring of the system covers , Whether the requirements of the four gold indicators have been followed ? For the problems of the system , Is there an emergency handling process , Is there an emergency plan ? If it's an old system , Is there any historical fault record, post event summary and problem root cause analysis ? If it is a new system, is it connected to the company's internal fault management system ( Business continuity management platform )? Does the system have a standardized test release process ? Is there a targeted increase in stability related tests , Ensure that the software will not have some common universality problems when it is released to the production environment ? for instance : Common problems such as business logic errors caused by boundary values ? Have you done capacity planning for the system , Whether the load balancing system can correctly use these capacities ? Is there a standardized process and framework for software development and product design ? When we think systematically , We will have new harvest , And the solution is more perfect .
The fourth part of the book mainly introduces how to quickly cultivate SRE Join in on-call, Handle disruptive tasks ,SRE Communication and collaboration with other teams , as well as SRE The evolution of participation mode . This part mainly introduces the knowledge of management , But I personally think a very important point is the description of the three participation models ; Simple PRR Model , Early participation model , The framework and SRE platform . The framework and SRE Platform mode , Provides many benefits , for instance : Significantly reduce operation and maintenance costs , Because it supports code structure 、 Dependency relationship 、 test 、 Strong compliance testing of coding style guidelines, etc , Built in service deployment 、 Monitoring and Automation , The built-in versatility support in the design , In this framework, code patterns based on production best practices are standardized and encapsulated , Give Way SRE Reduce the burden of cognition in management , At the same time, the quality of service can still be maintained , Each standard framework provides a complete solution for the problem area or the problem related infrastructure since its establishment .
This book is not a theoretical boast , It can be easily used by others SRE Reused by the team , This book is published in 2016, There are already 6 A year , When I look back at this book , I was surprised to find that , The theories and solutions in the book are still applicable , About SRE The fundamental responsibility and main focus of attention have remained basically unchanged in the past decade , Personally, I think it's too conservative .
边栏推荐
- Hexagon_ V65_ Programmers_ Reference_ Manual(6)
- MapGIS三维管线建模,唤醒城市地下管线脉搏
- 智能网联跑出中国「加速度」,26.15%搭载率背后的市场洗牌
- 激光雷达中国前装大幕开启,数百万颗产能待消化
- PG free space map & visibility map
- Diffuse reflection of QT OpenGL light
- Ue5 uses DLSS (super sampling) to improve the FPS of the scene away from the optimization scheme of Caton
- Brief description of tenant and multi tenant concepts in cloud management platform
- Where is the program?
- Uncaught SyntaxError: redeclaration of let page
猜你喜欢

PG free space map & visibility map

Where is the program?

“收割”NFT:200元淘宝买图,上链卖30万元

Face recognition 5.1- insightface face face detection model training practice notes

Beijing / Shanghai / Guangzhou / Shenzhen dama-cdga/cdgp data governance certification registration conditions

Read Plato & nbsp; Eplato of farm and the reasons for its high premium

Leetcode daily practice - 21. Merge two ordered linked lists

知识管理系统推动企业信息化发展

Leetcode daily practice - 203. remove linked list elements

Installation and use tutorial of the latest version of Web vulnerability scanning tool appscan\awvs\xray
随机推荐
认识网络模型网络模型概述
MapGIS三维场景渲染技术与应用
Installation and use tutorial of the latest version of Web vulnerability scanning tool appscan\awvs\xray
行为级描述与RTL级描述
How to solve the problem when the Microsoft account login of the computer keeps turning around
搭建discuz论坛并攻破盗取数据库
82. (cesium article) cesium points move on 3D models
Chapter 7 Intermediate Shell Tool I
Qt OPenGL 光的漫反射
二舅,为什么火了?
Second uncle, why is it so hot?
SQL coding bug
LeetCode每日一练 —— 876. 链表的中间结点
激光雷达中国前装大幕开启,数百万颗产能待消化
NPDP | what kind of product manager can be called excellent?
Elk too heavy? Try KFC log collection
Leetcode daily practice - 876. Intermediate node of linked list
Airiot Q & A issue 6 | how to use the secondary development engine?
PHP code audit 6 - file contains vulnerability
IPv4/IPv6、DHCP、网关、路由