当前位置：网站首页>Subversive cognition: what does SRE do?

Subversive cognition: what does SRE do?

2022-07-05 03:27:00 【Zhu Xiaosi】

Click on the above “ Zhu Xiaosi's blog ”, choice “ Set to star ”

The background to reply " book ", obtain

The background to reply “k8s”, Can claim k8s Information

A lot of people have asked me. I want to know SRE This job , It's a big topic , In this article, let's introduce some ideas .

SRE What is it ？ This is one of the earliest Google Proposed concept , My understanding is that , Solve operation and maintenance problems with software . Standardization , automation , Scalable , High availability is the main work . When this position was proposed , The problem to be solved is to break the developers' desire to iterate quickly , And operation and maintenance personnel want to maintain stability , The contradiction between rejecting frequent updates .

SRE At present, it is still difficult for recruitment . One side , This position requires some experience , In general, fresh students will not have the experience of operating and maintaining complex software ; On the other hand, many people still think this is “ Operation and maintenance ” The engineer , I think I'm doing some low-level repetitive work , There is a rejection of the job . The most fundamental , In fact, this position is looking for developers with operation and maintenance experience , Or an operation and maintenance engineer with software development skills . So it's hard to find the right person .

in real life , Different companies SRE The positions are very different , Some may even change the name of a post from the name of traditional operation and maintenance .

For example, there are two kinds of ant gold clothes SRE, One is responsible for stability , That's what everyone understands SRE; The other is called capital security SRE, Not responsible for the normal operation of the service , But responsible for the right amount of money , There is no error in reconciliation , The work content is mainly development , Mainly the fund verification platform and verification rules （ Never done it. , It's just personal understanding ）. In a sense , It's not SRE But the development of professional fields .

Netflix （2016 year ） Who developed the model , Who maintains .SRE Responsible for providing technical support , And consulting services .Netflix In the world 170 Countries have services ,Core SREs Only 5 personal .

Microsoft has specialized Game Streaming SRE, be responsible for XBox Stability of online games .

So different companies SRE Each has its own emphasis on , It depends on what kind of service the company is going to provide .

We can learn the way of network layering , take SRE The general work content is divided from bottom to top 3 Two categories: ：

Infrastructure： Mainly responsible for the most basic hardware facilities , The Internet , Be similar to IaaS, You can refer to DigitalOcean

Platform： Provide middleware technology , Some services out of the box , Be similar to PaaS, You can refer to Heroku, GCP, AWS etc.

Business SRE： Maintenance services , application , Maintain the normal operation of the business

One 、Infrastructure

Infrastructure and Platform SRE In fact, it's optional , In fact, there are more and more commercialized services in recent years , such as , If the company chooses all in AWS If you deploy your own services , Then you don't need to build your own Datacenter, Maintaining the network and so on , Just a few AWS Experts can .

If any , The work content can be large or small . Can be purchased from management VPS Start , You can also start by purchasing hardware servers .

I think Infrastructure SRE The work content of can be defined as ：

Responsible for server procurement , The budget ,CMDB management . Need to know （ You can find ） Who is in charge of each station , Doing? . This is very important , If it's not good , It will cause a great waste of resources .

Provide a reliable software deployment environment , Usually virtual machines , perhaps bare mental.

The version of the operating system is maintained uniformly ,Linux The release version ,Kernel Version, etc .

Maintain the basic software on the machine , such as NTP, Monitoring agent , Other agents .

Provide the login method of the machine , Rights management , Order audit .

Maintain an observable infrastructure , For example, monitoring system ,log System ,trace System .

Maintaining the network , Large companies may design their own network in the computer room . These include ：

Network connectivity , This is necessary . For upper level users （Platform SRE） Come on , The service delivered should be any two IP Yes. ping common , That is, manage 3 Network below layer .
NAT service
DNS service
A firewall
4 Layer load balancing ,7 Layer load balancing
CDN
Certificate management

Each can be a big team , Or there can be only one person to commercialize Infra service . Open source products can be used , You can also develop it yourself .

Two 、Platform SRE

Infrastructure SRE The maintenance is infrastructure ,Platform SRE Use the infrastructure they provide to build software services , Let developers in the company use out of the box software services , such as Queue,Cache, Timing task ,RPC Services, etc. .

The main work contents are ：

RPC service ： So that different services can discover and call each other

Private cloud services

Queue service , such as Kafka perhaps RabbitMQ

A distributed cronjob service

Cache

Gateway service ： Configuration of reverse agent

Object storage ：s3

Other databases ：ES,mongo wait . Generally speaking , Relational databases will have DBA Operation and maintenance , however NoSQL Or the graph database is generally composed of SRE maintain .

Internal development environment ：

SCM System , For example, self built Gitlab
CI/CD System
Mirror system , such as Harbor
Other development tools , For example, distributed compilation ,Sentry Error management, etc

Some offline computing environments , Big data services

3、 ... and 、 Business SRE

With Platform SRE Support for , When developers write code, they basically don't need to care about deployment . Can focus on developing , Use the company's out of the box service . This layer SRE Closer to the business , Know how the business works , How is the request handled , Which components are dependent . If X Except for the problem , What degradation strategies can be used . Participate in application architecture design , Provide technical support .

The main work contents are ：

Participate in system design . Such as fusing 、 Downgrade , Strategies such as capacity expansion .

Do pressure test , Understand the capacity of the system .

Do capacity planning .

Business side Oncall.

For a professional SRE Come on , The above skills should not have obvious boundaries , For example, business SRE You also need to master some network skills ,Infra SRE Also write some code . Many tools are used by people in every position , such as Ansible/Puppet/SaltStack such IT Automation tools , perhaps Grafana/Prometheus This monitoring tool , Only understanding can be used correctly . To put it another way , For business SRE Come on , Although basically not to manage the network below the four layers , But if you encounter network problems , You can troubleshoot the switch problems through the existing tools and permissions , Look for Infra SRE Help ：“ Please help me see xx IP Check whether the switch is abnormal , because xxx The result is xx”, Better than “ I doubt that xx There are network problems , Please help check ” Better ？

The above is the general division of job responsibilities , This stratification is actually meaningless , It can let readers know SRE Which jobs are involved .

Here are some daily work contents .

Four 、 Deployment Services

There are two types of deployment ：

Day 1： The day the service is deployed online

Day 2+： After service deployment , There will be many updates , upgrade , Configuration changes , Service migration and so on

Day2+ Your work needs to be done many times ,Day 1 Little to do , After constant iterations and upgrades , It also guarantees a reliable Day 1 Operation is difficult . let me put it another way , We have been changing since the service deployment , Also ensure that the service can be reliably deployed in a new environment . Hard coding of the deployment environment , Bizarre work around, Will destroy Day 1 The reliability of the . A previous company , The process of expanding a new computer room is a nightmare , Too many strange configurations ,hardcode, It leads to stepping on countless pits to deploy all services in a new computer room .

Day2+ The operation is not simple , Mainly focus on stability . For important change operations, a change plan shall be designed , How to do grayscale test , How to roll back if something goes wrong , How to ensure that rollback can succeed （ How to test rollback ） wait .

It is best that the deployment operations are traceable , Because not all operations that cause problems will cause problems immediately . For example, when an operation is completed, there is no problem , But after 1 Months , Accidental restart or memory reaching a certain index triggered the problem . If you can record the operation , We can go back to previous changes , Easy to locate problems . Now it's generally used git To track changes in the deployment process （gitops）.

5、 ... and 、Oncall

Oncall In short, it is to ensure the normal operation of online services . A typical workflow is ： Got a warning , Check the cause of the alarm , Confirm whether there is a problem with the online service , Focus on the problem , solve the problem .

Receiving an alarm does not always mean a real problem , It is also possible that the alarm setting is unreasonable . The alarm and monitoring panel is not a static configuration , It should change every day , Always adjusting . If you find that there is no real online problem, an alarm is sent out , You should modify the alarm rules . If it is found that the current monitoring cannot quickly locate the problem , The monitoring panel should be adjusted , Add or delete monitoring indicators . The business is developing , Requests are changing , Some thresholds also need to be constantly adjusted .

There is no general way to solve the positioning problem , Need to be based on what you see in real time , Combine your own experience , Then speculate , Then use tools to verify your speculation , Then determine the root cause of the problem .

But there can be a methodology for solving problems , be called SOP, Standard operating procedures . namely ： If this happens , So what kind of operation , You can resume business .SOP Documentation should be prepared in advance , And verify its effectiveness .

It should be noted that the above positioning problems 、 There is no order in solving problems . A common mistake is , In the event of a breakdown , It took a long time to locate the root cause of the fault , And then fix it . It usually takes a long time . The correct way is to look at the existing... According to the phenomenon SOP Whether the business can be resumed . For example, the current error only occurs on a certain node , Then go offline directly to this node , The specific reasons will be investigated later . Restoring the current fault is always the first priority . But recovery operations are also tested , For example, I guess the problem can be solved by restarting , You can restart one to test , Instead of restarting all services at once . Most cases require on-the-spot analysis , It's a stressful and exciting process .

How long does the fault recover ？ How many failures are tolerable ？ How to mark the stability of the service ？ We use SLI/SLO To measure these problems .

6、 ... and 、 Develop and deliver SLI/SLO

Maintain service level agreements , Sounds like a very simple thing , as long as “ Set an availability ” Then go and realize it . However, the reality is not .

such as , When setting availability , Not that we're going to “ Realization 4 individual 9”（99.99% Time available ） That's enough , We have the following questions to consider ：

How to define this availability ？ For example, we use availability > 99.9% Target , There is a service deployed 5 individual Zone, So there's one Zone Hang up , The rest Zone Is available , So is the availability broken ？ This availability rate is every Zone Or all Zone Calculated together ？

What is the smallest unit of availability calculation ？ If 1min There are 50s Not reaching availability , So this minute is down still up？

How to calculate the period of availability ？ According to a month or a week ？ A week is the most recent 7 Days or a natural week ？

How to SLI and SLO Do monitoring ？

If wrong, the budget will run out , What measures are there ？ For example, reduce Publishing ？ If SLI and SLO What happens if you don't reach ？

wait , If these problems are not considered clearly , that SLI and SLO It's probably meaningless .SLI/SLO It also applies to the commitment to internal users of the company , Let users have expectations of our services , Without blind trust . such as Google stay SLI/SLO And the budget , Will be satisfied SLI/SLO Do some damage to the service on your own , Don't let users have 100% Available error expectations .SLI/SLO Will also let SRE I have a better understanding of the stability of the current service , Operation and maintenance can be adjusted according to this 、 change 、 Release plan .

7、 ... and 、 Trouble shooting

The only purpose of fault recovery is to reduce the occurrence of faults . There are a few things I think are good at present .

Failure recovery requires documentation , Including the process of failure , Record of timeline , Records of operations , Methods of fault recovery , Analysis of fault root cause , Analysis of why the failure occurs . The document should hide the names of all parties and make it public to everyone in the company . Many companies set viewing permissions for fault documents , I don't think it makes any sense . Some companies' failure recovery is even public .

In case of failure, the name of the party concerned should be replaced with code , Can create a better discussion atmosphere .

It should not be required that all fault redo should occur Action. The failure of a previous company was repeated , Because the leader must be given a “ Confession ”, So every time, some measures will be taken to prevent the same fault from happening again , Such as adding an approval process . That's bullshit , Ask high-level leaders to approve operations that they don't understand , Can only make leaders more painful , It also makes the operation process smelly and long , Finally, everyone will forget why there is an approval here , But no one dares to delete . You delete it , If something happens, you are responsible for .

Blame Free Culture ？ I thought it was good . But it turns out , Some problems caused by not following the process should be Blame once , For example, they didn't check when they went offline, and they didn't tcp The connection goes straight offline , Or you didn't do it during the operation canary It's all done , The failure caused by this irrational behavior . But there should not be too many rules , Or you can't do any work .

8、 ... and 、 Capacity planning

Capacity planning is a very complex problem , There are even some paradoxes . Capacity should be planned in advance , However, capacity planning needs to know the speed of business expansion , Expansion speed is not something that can be planned in advance . So I always find it difficult to do , I've never seen a good example of doing it .

But at least you can build a model of the maintained system , Know how many machines , How many resources , How much capacity can it hold . In this way, the amount of resources needed can be estimated in time in case of activities such as big promotion .

Nine 、 User support

User support is also part of the day . Including technical consultation , And online troubleshooting required by users .

Here we need to mention the importance of documents . If the document is not well maintained , Then the user will ask the same question again and again . Writing documents is also a technical activity , Excellent takes a long time to accumulate . Documents also need to be updated frequently . I usually do this , Maintain such a state ： The user can find all the answers he needs from the document without anyone . If I find that the user's problem cannot be found in the document , Or it's hard to find somewhere in the document , The document will be updated , Or reorganize the document . If the user's problem has been found in the document , Then send him the document directly . If the user's problem is obviously that they haven't seen the document （ There are a lot of people who don't read documents at all , Just look at who wrote the document, and then go straight to the person ）, Just ignore .

Good documents should introduce as few proper nouns as possible , Use less useless professional words to describe , Describe only instructive facts , It is assumed that the user has no relevant background knowledge , Give examples , Give some examples that will be used in reality rather than forced examples , clear Bad Case. wait . This is actually a big topic , It's not going to unfold here .

I think of these for the time being . Here are some misunderstandings I often see , And questions often asked by others .

At the end

1、 There is no professional team and no training for the project

This is the most complained about . Although I say SRE At work, the development time and operation and maintenance time should be different 50%, But the truth is , Even if SRE There's some development work , Most of them are for internal users , For developers inside the company . Most projects are ideas , You need to try. No , Basically, there will be no professional design resources ,PM resources . This kind of project requires SRE There are many skills , Including understanding of the product , Clearly know what pain points it has , It's better to be the pain points you've experienced , Then you need to understand design , Manage the development progress . However, such people are very few . In fact, you can write code for medium-sized projects SRE There are very few . Therefore, most internal projects are difficult to use and complex .

Even with professional support PM And design , Even front-end resources . It's basically a disaster . I've also experienced such a team . This internal project is not targeted at Internet projects , It's more like toB Project . user UI The design of the , Interactive logic , Operation process , Lead times and so on require knowledge in another area . Otherwise, the more people , It will only increase the communication cost , Slow down the project .

Back to the complaint I often hear , Said in SRE Your team doesn't have the same... As the development team “ Regular army ”, There are design and PM, Everyone does their own job , Back end development just aligns API Then implement it . Most fresh students will have such fantasies , But it's not . The most important thing to be mistaken is , Learning mainly depends on one's own , It doesn't have much to do with others . I think it may be in a large team , There are many people doing one thing together , There will be less doubt and anxiety , People will feel secure in such a working state , Mistaken for “ grow up ”, Do all the work yourself and worry more .

The fact is that , Working in a large team may learn more communication skills , For example, align work goals at different stages with different people , If you want to learn something else, you still have to rely on yourself . Like getting a design , If it's done as it is , In fact, I won't learn anything . But to understand why this design , Why not design like that . If you do it yourself , The process of thinking is basically like this , How can I design , What to choose . All are ： reflection , choice , Try , Experience , reflection ……

Another misunderstanding that needs to be clarified is , Imitation is not learning . Experienced a design in the team , If you remember this design , Use this design to solve similar problems next time . This can't be called learning . I've seen payment in the business department SRE Written code , Orders that implement order business in the internal system 、 Transaction and other concepts complete an operation and maintenance process , even to the extent that Model My name hasn't changed . Take a hammer and find a nail , Will make the system worse and more complex .

All in all , The division of work does not mean that the work will be more professional . A person with several occupations can be very professional in every aspect . It's important to keep learning , Use the right way of doing things , Learn from excellent projects and excellent developers .

2、 About dirty work

Every job has dirty and tiring work ： I can't learn anything , It's boring to do . It may be the monitoring of the sorting system , It may be sorting out existing documents , Maybe clean up some old O & M scripts , It may be that you need to do some communication work with different teams .

It's inevitable , If you can , Learn to find some lazy ways from every job , For example, use scripts to handle some work , Work smarter, etc .

But if the proportion of such jobs is too high , It's time to think about the way we work . If you fall into a vicious circle , See if you can make some changes in tools and workflow . If not , Consider changing your job .

3、 About carrying the pot

The working environment of throwing pots at each other is undoubtedly a very bad working environment . If the same team 、 Or if different teams need to collude with each other , If the work environment does not allow generous recognition （SRE Inevitably, some mistakes will be made ） Your own mistakes , It shows that there is something wrong with the atmosphere created by the company .

For example, some company regulations , happen P1 If you make a grade mistake, you must fire one Px Level staff , happen P0 If you make a grade mistake, you must fire one Py The same level of employees . If that's the case , The company is actually using a lazy method to improve the stability of the system by increasing people's pressure . I don't know if it works , But it's certain that no one will be happy to work in this situation . Suggest changing a job .

4、 How to change careers ？

In fact, the difficulty is not as high as expected , After all, there is no university called SRE Major of .SRE The required knowledge is also writing code 、 Design the system 、 Understand the operating system and network, etc . So learn the undergraduate courses well in the University , Try to do （ And maintain ） Some of your own projects , When you graduate, you basically meet the requirements . If non professional people want to change careers , You can also refer to the course content of the university to supplement this knowledge .

It should be noted that , After the training course, you can do the development and complete the business , But do SRE It's not enough .SRE Not only need make things work, Also know the principle behind it .

5、 What will the interview ask ？

I think the interview content is basically the same as that of back-end development .

If you are applying for some skills required for this position , such as K8S, Monitoring system, etc , You may also ask for knowledge in some fields . Although this part of instrumental things can be learned , But if someone wants an experienced 、 Or you can work when you get on the job , Then the chances of success in the interview will be much smaller . Of course , You don't have to be depressed , This is determined by the supply-demand relationship of the market , If the other party insists on finding a candidate who meets specific requirements , Then the other party's choice range will be much smaller , Don't regret not learning any tools because you missed this opportunity . Then again , The more skills , There will be more choices .

Troubleshooting errors may be changing careers SRE The biggest threshold , This requires some experience . If you have no experience , Just make up some knowledge of the operating system , In this way, unknown problems can also be checked through known knowledge and tools .

This warehouse is a good collection of interview questions ：https://github.com/bregman-arie/devops-exercises

6、 do SRE Need to be able to write code ？

Meeting , And the requirement of writing code is not lower than that of a professional back-end development .

7、 Choose big company or small company ？

These are two very different working environments . Small companies usually have a fire hero , I have been in the company for a long time , Know the deployment structure of all components , Know everything . Learning with such people will grow quickly .

There are many segments of large companies . Each of the items listed earlier in this article may be a team in a large company , An in-depth study of a certain field .

So it depends on what you want to do . Personally, I prefer reliable small companies , Or a small reliable team in a large company .

8、 How to judge whether a company is reliable ？

about SRE This position , I summarized some judgment skills . For example, you can judge the other party's current business and SRE Whether the number of employees is in a “ normal ” The state of , Whether the number of people is changing with the business （ Number of machines ） Phenomenon growth ？ This is a bad sign . whether SRE Too many ？ If SRE Too many people , There are two possible reasons ：

A leader is working for some in order to expand his influence “ unnecessary ” Job recruitment , This will lead to more people and less things , Everyone began to do some strange things , The need to invent strange things , Waste your time in various ways to get paid by the company ;

The foundation of this company is too poor , Most of the work requires human operation and maintenance , As a result, basically as many people as there are machines . All in all , It's not a good thing .

Some companies with better technology , There is no huge SRE team , such as Instagram, Netflix（ There may be a lot of people now ）, And some startups , There can even be no special SRE, first-class SRE First, if the developer , Good developers are also far from SRE Not far away . Some familiar Services , such as webarchive This amount of data , In fact, there are only a few people behind it . I interviewed a domestic company a few years ago , In the computer room all over the world , The business has developed quite a lot （ Listed on the ） When ,SRE The team has only 10 personal .

Another question I like to ask is about AIOps What do you think . Because I've been doing this for two years , The final conclusion is , This is basically a waste of time 、 Something that deceives the upper leadership .AI The unexplainability of this thing is in essence contrary to the cause and effect of O & M operation . So I often like to ask the interviewer what they think of this technology , Basically, we can judge whether it is reliable or not . Yes, of course , This is the sequelae of my personal career shadow , It can only represent personal opinions .

That's all , It's all personal understanding , Not necessarily . Writing this article feels like giving directions , In fact, I have only worked for a few years , Therefore, the content of this paper is only for reference .

Welcome to the comment area to discuss ~

The author 丨 laixintao

Source: Website ：https://www.kawabangga.com/posts/4481