当前位置:网站首页>Subversive cognition: what does SRE do?
Subversive cognition: what does SRE do?
2022-07-05 03:27:00 【Zhu Xiaosi】
Click on the above “ Zhu Xiaosi's blog ”, choice “ Set to star ”
The background to reply " book ", obtain
The background to reply “k8s”, Can claim k8s Information
A lot of people have asked me. I want to know SRE This job , It's a big topic , In this article, let's introduce some ideas .
SRE What is it ? This is one of the earliest Google Proposed concept , My understanding is that , Solve operation and maintenance problems with software . Standardization , automation , Scalable , High availability is the main work . When this position was proposed , The problem to be solved is to break the developers' desire to iterate quickly , And operation and maintenance personnel want to maintain stability , The contradiction between rejecting frequent updates .
SRE At present, it is still difficult for recruitment . One side , This position requires some experience , In general, fresh students will not have the experience of operating and maintaining complex software ; On the other hand, many people still think this is “ Operation and maintenance ” The engineer , I think I'm doing some low-level repetitive work , There is a rejection of the job . The most fundamental , In fact, this position is looking for developers with operation and maintenance experience , Or an operation and maintenance engineer with software development skills . So it's hard to find the right person .
in real life , Different companies SRE The positions are very different , Some may even change the name of a post from the name of traditional operation and maintenance .
For example, there are two kinds of ant gold clothes SRE, One is responsible for stability , That's what everyone understands SRE; The other is called capital security SRE, Not responsible for the normal operation of the service , But responsible for the right amount of money , There is no error in reconciliation , The work content is mainly development , Mainly the fund verification platform and verification rules ( Never done it. , It's just personal understanding ). In a sense , It's not SRE But the development of professional fields .
Netflix (2016 year ) Who developed the model , Who maintains .SRE Responsible for providing technical support , And consulting services .Netflix In the world 170 Countries have services ,Core SREs Only 5 personal .
Microsoft has specialized Game Streaming SRE, be responsible for XBox Stability of online games .
So different companies SRE Each has its own emphasis on , It depends on what kind of service the company is going to provide .
We can learn the way of network layering , take SRE The general work content is divided from bottom to top 3 Two categories: :
Infrastructure: Mainly responsible for the most basic hardware facilities , The Internet , Be similar to IaaS, You can refer to DigitalOcean
Platform: Provide middleware technology , Some services out of the box , Be similar to PaaS, You can refer to Heroku, GCP, AWS etc.
Business SRE: Maintenance services , application , Maintain the normal operation of the business
One 、Infrastructure
Infrastructure and Platform SRE In fact, it's optional , In fact, there are more and more commercialized services in recent years , such as , If the company chooses all in AWS If you deploy your own services , Then you don't need to build your own Datacenter, Maintaining the network and so on , Just a few AWS Experts can .
If any , The work content can be large or small . Can be purchased from management VPS Start , You can also start by purchasing hardware servers .
I think Infrastructure SRE The work content of can be defined as :
Responsible for server procurement , The budget ,CMDB management . Need to know ( You can find ) Who is in charge of each station , Doing? . This is very important , If it's not good , It will cause a great waste of resources .
Provide a reliable software deployment environment , Usually virtual machines , perhaps bare mental.
The version of the operating system is maintained uniformly ,Linux The release version ,Kernel Version, etc .
Maintain the basic software on the machine , such as NTP, Monitoring agent , Other agents .
Provide the login method of the machine , Rights management , Order audit .
Maintain an observable infrastructure , For example, monitoring system ,log System ,trace System .
Maintaining the network , Large companies may design their own network in the computer room . These include :
Network connectivity , This is necessary . For upper level users (Platform SRE) Come on , The service delivered should be any two IP Yes. ping common , That is, manage 3 Network below layer .
NAT service
DNS service
A firewall
4 Layer load balancing ,7 Layer load balancing
CDN
Certificate management
Each can be a big team , Or there can be only one person to commercialize Infra service . Open source products can be used , You can also develop it yourself .
Two 、Platform SRE
Infrastructure SRE The maintenance is infrastructure ,Platform SRE Use the infrastructure they provide to build software services , Let developers in the company use out of the box software services , such as Queue,Cache, Timing task ,RPC Services, etc. .
The main work contents are :
RPC service : So that different services can discover and call each other
Private cloud services
Queue service , such as Kafka perhaps RabbitMQ
A distributed cronjob service
Cache
Gateway service : Configuration of reverse agent
Object storage :s3
Other databases :ES,mongo wait . Generally speaking , Relational databases will have DBA Operation and maintenance , however NoSQL Or the graph database is generally composed of SRE maintain .
Internal development environment :
SCM System , For example, self built Gitlab
CI/CD System
Mirror system , such as Harbor
Other development tools , For example, distributed compilation ,Sentry Error management, etc
Some offline computing environments , Big data services
3、 ... and 、 Business SRE
With Platform SRE Support for , When developers write code, they basically don't need to care about deployment . Can focus on developing , Use the company's out of the box service . This layer SRE Closer to the business , Know how the business works , How is the request handled , Which components are dependent . If X Except for the problem , What degradation strategies can be used . Participate in application architecture design , Provide technical support .
The main work contents are :
Participate in system design . Such as fusing 、 Downgrade , Strategies such as capacity expansion .
Do pressure test , Understand the capacity of the system .
Do capacity planning .
Business side Oncall.
For a professional SRE Come on , The above skills should not have obvious boundaries , For example, business SRE You also need to master some network skills ,Infra SRE Also write some code . Many tools are used by people in every position , such as Ansible/Puppet/SaltStack such IT Automation tools , perhaps Grafana/Prometheus This monitoring tool , Only understanding can be used correctly . To put it another way , For business SRE Come on , Although basically not to manage the network below the four layers , But if you encounter network problems , You can troubleshoot the switch problems through the existing tools and permissions , Look for Infra SRE Help :“ Please help me see xx IP Check whether the switch is abnormal , because xxx The result is xx”, Better than “ I doubt that xx There are network problems , Please help check ” Better ?
The above is the general division of job responsibilities , This stratification is actually meaningless , It can let readers know SRE Which jobs are involved .
Here are some daily work contents .
Four 、 Deployment Services
There are two types of deployment :
Day 1: The day the service is deployed online
Day 2+: After service deployment , There will be many updates , upgrade , Configuration changes , Service migration and so on
Day2+ Your work needs to be done many times ,Day 1 Little to do , After constant iterations and upgrades , It also guarantees a reliable Day 1 Operation is difficult . let me put it another way , We have been changing since the service deployment , Also ensure that the service can be reliably deployed in a new environment . Hard coding of the deployment environment , Bizarre work around, Will destroy Day 1 The reliability of the . A previous company , The process of expanding a new computer room is a nightmare , Too many strange configurations ,hardcode, It leads to stepping on countless pits to deploy all services in a new computer room .
Day2+ The operation is not simple , Mainly focus on stability . For important change operations, a change plan shall be designed , How to do grayscale test , How to roll back if something goes wrong , How to ensure that rollback can succeed ( How to test rollback ) wait .
It is best that the deployment operations are traceable , Because not all operations that cause problems will cause problems immediately . For example, when an operation is completed, there is no problem , But after 1 Months , Accidental restart or memory reaching a certain index triggered the problem . If you can record the operation , We can go back to previous changes , Easy to locate problems . Now it's generally used git To track changes in the deployment process (gitops).
5、 ... and 、Oncall
Oncall In short, it is to ensure the normal operation of online services . A typical workflow is : Got a warning , Check the cause of the alarm , Confirm whether there is a problem with the online service , Focus on the problem , solve the problem .
Receiving an alarm does not always mean a real problem , It is also possible that the alarm setting is unreasonable . The alarm and monitoring panel is not a static configuration , It should change every day , Always adjusting . If you find that there is no real online problem, an alarm is sent out , You should modify the alarm rules . If it is found that the current monitoring cannot quickly locate the problem , The monitoring panel should be adjusted , Add or delete monitoring indicators . The business is developing , Requests are changing , Some thresholds also need to be constantly adjusted .
There is no general way to solve the positioning problem , Need to be based on what you see in real time , Combine your own experience , Then speculate , Then use tools to verify your speculation , Then determine the root cause of the problem .
But there can be a methodology for solving problems , be called SOP, Standard operating procedures . namely : If this happens , So what kind of operation , You can resume business .SOP Documentation should be prepared in advance , And verify its effectiveness .
It should be noted that the above positioning problems 、 There is no order in solving problems . A common mistake is , In the event of a breakdown , It took a long time to locate the root cause of the fault , And then fix it . It usually takes a long time . The correct way is to look at the existing... According to the phenomenon SOP Whether the business can be resumed . For example, the current error only occurs on a certain node , Then go offline directly to this node , The specific reasons will be investigated later . Restoring the current fault is always the first priority . But recovery operations are also tested , For example, I guess the problem can be solved by restarting , You can restart one to test , Instead of restarting all services at once . Most cases require on-the-spot analysis , It's a stressful and exciting process .
How long does the fault recover ? How many failures are tolerable ? How to mark the stability of the service ? We use SLI/SLO To measure these problems .
6、 ... and 、 Develop and deliver SLI/SLO
Maintain service level agreements , Sounds like a very simple thing , as long as “ Set an availability ” Then go and realize it . However, the reality is not .
such as , When setting availability , Not that we're going to “ Realization 4 individual 9”(99.99% Time available ) That's enough , We have the following questions to consider :
How to define this availability ? For example, we use availability > 99.9% Target , There is a service deployed 5 individual Zone, So there's one Zone Hang up , The rest Zone Is available , So is the availability broken ? This availability rate is every Zone Or all Zone Calculated together ?
What is the smallest unit of availability calculation ? If 1min There are 50s Not reaching availability , So this minute is down still up?
How to calculate the period of availability ? According to a month or a week ? A week is the most recent 7 Days or a natural week ?
How to SLI and SLO Do monitoring ?
If wrong, the budget will run out , What measures are there ? For example, reduce Publishing ? If SLI and SLO What happens if you don't reach ?
wait , If these problems are not considered clearly , that SLI and SLO It's probably meaningless .SLI/SLO It also applies to the commitment to internal users of the company , Let users have expectations of our services , Without blind trust . such as Google stay SLI/SLO And the budget , Will be satisfied SLI/SLO Do some damage to the service on your own , Don't let users have 100% Available error expectations .SLI/SLO Will also let SRE I have a better understanding of the stability of the current service , Operation and maintenance can be adjusted according to this 、 change 、 Release plan .
7、 ... and 、 Trouble shooting
The only purpose of fault recovery is to reduce the occurrence of faults . There are a few things I think are good at present .
Failure recovery requires documentation , Including the process of failure , Record of timeline , Records of operations , Methods of fault recovery , Analysis of fault root cause , Analysis of why the failure occurs . The document should hide the names of all parties and make it public to everyone in the company . Many companies set viewing permissions for fault documents , I don't think it makes any sense . Some companies' failure recovery is even public .
In case of failure, the name of the party concerned should be replaced with code , Can create a better discussion atmosphere .
It should not be required that all fault redo should occur Action. The failure of a previous company was repeated , Because the leader must be given a “ Confession ”, So every time, some measures will be taken to prevent the same fault from happening again , Such as adding an approval process . That's bullshit , Ask high-level leaders to approve operations that they don't understand , Can only make leaders more painful , It also makes the operation process smelly and long , Finally, everyone will forget why there is an approval here , But no one dares to delete . You delete it , If something happens, you are responsible for .
Blame Free Culture ? I thought it was good . But it turns out , Some problems caused by not following the process should be Blame once , For example, they didn't check when they went offline, and they didn't tcp The connection goes straight offline , Or you didn't do it during the operation canary It's all done , The failure caused by this irrational behavior . But there should not be too many rules , Or you can't do any work .
8、 ... and 、 Capacity planning
Capacity planning is a very complex problem , There are even some paradoxes . Capacity should be planned in advance , However, capacity planning needs to know the speed of business expansion , Expansion speed is not something that can be planned in advance . So I always find it difficult to do , I've never seen a good example of doing it .
But at least you can build a model of the maintained system , Know how many machines , How many resources , How much capacity can it hold . In this way, the amount of resources needed can be estimated in time in case of activities such as big promotion .
Nine 、 User support
User support is also part of the day . Including technical consultation , And online troubleshooting required by users .
Here we need to mention the importance of documents . If the document is not well maintained , Then the user will ask the same question again and again . Writing documents is also a technical activity , Excellent takes a long time to accumulate . Documents also need to be updated frequently . I usually do this , Maintain such a state : The user can find all the answers he needs from the document without anyone . If I find that the user's problem cannot be found in the document , Or it's hard to find somewhere in the document , The document will be updated , Or reorganize the document . If the user's problem has been found in the document , Then send him the document directly . If the user's problem is obviously that they haven't seen the document ( There are a lot of people who don't read documents at all , Just look at who wrote the document, and then go straight to the person ), Just ignore .
Good documents should introduce as few proper nouns as possible , Use less useless professional words to describe , Describe only instructive facts , It is assumed that the user has no relevant background knowledge , Give examples , Give some examples that will be used in reality rather than forced examples , clear Bad Case. wait . This is actually a big topic , It's not going to unfold here .
I think of these for the time being . Here are some misunderstandings I often see , And questions often asked by others .
At the end
1、 There is no professional team and no training for the project
This is the most complained about . Although I say SRE At work, the development time and operation and maintenance time should be different 50%, But the truth is , Even if SRE There's some development work , Most of them are for internal users , For developers inside the company . Most projects are ideas , You need to try. No , Basically, there will be no professional design resources ,PM resources . This kind of project requires SRE There are many skills , Including understanding of the product , Clearly know what pain points it has , It's better to be the pain points you've experienced , Then you need to understand design , Manage the development progress . However, such people are very few . In fact, you can write code for medium-sized projects SRE There are very few . Therefore, most internal projects are difficult to use and complex .
Even with professional support PM And design , Even front-end resources . It's basically a disaster . I've also experienced such a team . This internal project is not targeted at Internet projects , It's more like toB Project . user UI The design of the , Interactive logic , Operation process , Lead times and so on require knowledge in another area . Otherwise, the more people , It will only increase the communication cost , Slow down the project .
Back to the complaint I often hear , Said in SRE Your team doesn't have the same... As the development team “ Regular army ”, There are design and PM, Everyone does their own job , Back end development just aligns API Then implement it . Most fresh students will have such fantasies , But it's not . The most important thing to be mistaken is , Learning mainly depends on one's own , It doesn't have much to do with others . I think it may be in a large team , There are many people doing one thing together , There will be less doubt and anxiety , People will feel secure in such a working state , Mistaken for “ grow up ”, Do all the work yourself and worry more .
The fact is that , Working in a large team may learn more communication skills , For example, align work goals at different stages with different people , If you want to learn something else, you still have to rely on yourself . Like getting a design , If it's done as it is , In fact, I won't learn anything . But to understand why this design , Why not design like that . If you do it yourself , The process of thinking is basically like this , How can I design , What to choose . All are : reflection , choice , Try , Experience , reflection ……
Another misunderstanding that needs to be clarified is , Imitation is not learning . Experienced a design in the team , If you remember this design , Use this design to solve similar problems next time . This can't be called learning . I've seen payment in the business department SRE Written code , Orders that implement order business in the internal system 、 Transaction and other concepts complete an operation and maintenance process , even to the extent that Model My name hasn't changed . Take a hammer and find a nail , Will make the system worse and more complex .
All in all , The division of work does not mean that the work will be more professional . A person with several occupations can be very professional in every aspect . It's important to keep learning , Use the right way of doing things , Learn from excellent projects and excellent developers .
2、 About dirty work
Every job has dirty and tiring work : I can't learn anything , It's boring to do . It may be the monitoring of the sorting system , It may be sorting out existing documents , Maybe clean up some old O & M scripts , It may be that you need to do some communication work with different teams .
It's inevitable , If you can , Learn to find some lazy ways from every job , For example, use scripts to handle some work , Work smarter, etc .
But if the proportion of such jobs is too high , It's time to think about the way we work . If you fall into a vicious circle , See if you can make some changes in tools and workflow . If not , Consider changing your job .
3、 About carrying the pot
The working environment of throwing pots at each other is undoubtedly a very bad working environment . If the same team 、 Or if different teams need to collude with each other , If the work environment does not allow generous recognition (SRE Inevitably, some mistakes will be made ) Your own mistakes , It shows that there is something wrong with the atmosphere created by the company .
For example, some company regulations , happen P1 If you make a grade mistake, you must fire one Px Level staff , happen P0 If you make a grade mistake, you must fire one Py The same level of employees . If that's the case , The company is actually using a lazy method to improve the stability of the system by increasing people's pressure . I don't know if it works , But it's certain that no one will be happy to work in this situation . Suggest changing a job .
4、 How to change careers ?
In fact, the difficulty is not as high as expected , After all, there is no university called SRE Major of .SRE The required knowledge is also writing code 、 Design the system 、 Understand the operating system and network, etc . So learn the undergraduate courses well in the University , Try to do ( And maintain ) Some of your own projects , When you graduate, you basically meet the requirements . If non professional people want to change careers , You can also refer to the course content of the university to supplement this knowledge .
It should be noted that , After the training course, you can do the development and complete the business , But do SRE It's not enough .SRE Not only need make things work, Also know the principle behind it .
5、 What will the interview ask ?
I think the interview content is basically the same as that of back-end development .
If you are applying for some skills required for this position , such as K8S, Monitoring system, etc , You may also ask for knowledge in some fields . Although this part of instrumental things can be learned , But if someone wants an experienced 、 Or you can work when you get on the job , Then the chances of success in the interview will be much smaller . Of course , You don't have to be depressed , This is determined by the supply-demand relationship of the market , If the other party insists on finding a candidate who meets specific requirements , Then the other party's choice range will be much smaller , Don't regret not learning any tools because you missed this opportunity . Then again , The more skills , There will be more choices .
Troubleshooting errors may be changing careers SRE The biggest threshold , This requires some experience . If you have no experience , Just make up some knowledge of the operating system , In this way, unknown problems can also be checked through known knowledge and tools .
This warehouse is a good collection of interview questions :https://github.com/bregman-arie/devops-exercises
6、 do SRE Need to be able to write code ?
Meeting , And the requirement of writing code is not lower than that of a professional back-end development .
7、 Choose big company or small company ?
These are two very different working environments . Small companies usually have a fire hero , I have been in the company for a long time , Know the deployment structure of all components , Know everything . Learning with such people will grow quickly .
There are many segments of large companies . Each of the items listed earlier in this article may be a team in a large company , An in-depth study of a certain field .
So it depends on what you want to do . Personally, I prefer reliable small companies , Or a small reliable team in a large company .
8、 How to judge whether a company is reliable ?
about SRE This position , I summarized some judgment skills . For example, you can judge the other party's current business and SRE Whether the number of employees is in a “ normal ” The state of , Whether the number of people is changing with the business ( Number of machines ) Phenomenon growth ? This is a bad sign . whether SRE Too many ? If SRE Too many people , There are two possible reasons :
A leader is working for some in order to expand his influence “ unnecessary ” Job recruitment , This will lead to more people and less things , Everyone began to do some strange things , The need to invent strange things , Waste your time in various ways to get paid by the company ;
The foundation of this company is too poor , Most of the work requires human operation and maintenance , As a result, basically as many people as there are machines . All in all , It's not a good thing .
Some companies with better technology , There is no huge SRE team , such as Instagram, Netflix( There may be a lot of people now ), And some startups , There can even be no special SRE, first-class SRE First, if the developer , Good developers are also far from SRE Not far away . Some familiar Services , such as webarchive This amount of data , In fact, there are only a few people behind it . I interviewed a domestic company a few years ago , In the computer room all over the world , The business has developed quite a lot ( Listed on the ) When ,SRE The team has only 10 personal .
Another question I like to ask is about AIOps What do you think . Because I've been doing this for two years , The final conclusion is , This is basically a waste of time 、 Something that deceives the upper leadership .AI The unexplainability of this thing is in essence contrary to the cause and effect of O & M operation . So I often like to ask the interviewer what they think of this technology , Basically, we can judge whether it is reliable or not . Yes, of course , This is the sequelae of my personal career shadow , It can only represent personal opinions .
That's all , It's all personal understanding , Not necessarily . Writing this article feels like giving directions , In fact, I have only worked for a few years , Therefore, the content of this paper is only for reference .
Welcome to the comment area to discuss ~
The author 丨 laixintao
Source: Website :https://www.kawabangga.com/posts/4481
Want to know more ? sweep Trace the QR code below and follow me
The background to reply " technology ", Join the technology group
The background to reply “k8s”, Can claim k8s Information
边栏推荐
- Daily question 2 12
- This + closure + scope interview question
- Anchor free series network yolox source code line by line explanation Part 2 (a total of 10, ensure to explain line by line, after reading, you can change the network at will, not just as a participan
- Apache build web host
- 【微服务|SCG】Filters的33种用法
- Accuracy problem and solution of BigDecimal
- qrcode:将文本生成二维码
- Use of kubesphere configuration set (configmap)
- 1.五层网络模型
- [micro service SCG] 33 usages of filters
猜你喜欢
TCP security of network security foundation
Design of KTV intelligent dimming system based on MCU
SQL performance optimization skills
SPI and IIC communication protocol
Yyds dry goods inventory embedded matrix
Zero foundation uses paddlepaddle to build lenet-5 network
Watch the online press conference of tdengine community heroes and listen to TD hero talk about the legend of developers
Design and practice of kubernetes cluster and application monitoring scheme
Pat class a 1160 forever (class B 1104 forever)
Why are there fewer and fewer good products produced by big Internet companies such as Tencent and Alibaba?
随机推荐
this+闭包+作用域 面试题
MySQL winter vacation self-study 2022 11 (9)
Talk about the SQL server version of DTM sub transaction barrier function
Flume configuration 4 - customize mysqlsource
2.常见的请求方法
Elfk deployment
Port, domain name, protocol.
El tree whether leaf node or not, the drop-down button is permanent
[200 opencv routines] 99 Modified alpha mean filter
[2022 repair version] community scanning code into group activity code to drain the complete operation source code / connect the contract free payment interface / promote the normal binding of subordi
How can we truncate the float64 type to a specific precision- How can we truncate float64 type to a particular precision?
The perfect car for successful people: BMW X7! Superior performance, excellent comfort and safety
51 independent key basic experiment
1. Five layer network model
Devtools的简单使用
LeetCode 234. Palindrome linked list
Design and implementation of community hospital information system
Asemi rectifier bridge 2w10 parameters, 2w10 specifications, 2w10 characteristics
Apache Web page security optimization
平台入驻与独立部署优缺点对比