当前位置：网站首页>Sre core system understanding

Sre core system understanding

2022-07-05 06:45:00 【Dream of finding flowers~~】

List of articles

Reference resources B Link to station video materials https://www.bilibili.com/video/BV1ak4y1975Z?spm_id_from=333.1007.top_right_bar_window_custom_collection.content.click

SRE The source of the

https://sre.google/books/
Insert picture description here

SRE What is it? ？

SRE- Full name ”Site Reliability Engineering“, Station reliability engineering , Come of 2003 year
A reliable framework for the operation and maintenance of large-scale systems
It's about letting software engineers design the operation and maintenance functions
Be responsible for the operation of the production system at the operation and maintenance level
Build and operate high reliability systems 、 The best way to apply universally

What is not SRE？

SRE The principle sounds good , however , It is not acclimatized here , oranges change with their environment , It can only grow in a specific culture , It only makes sense for super large scale
SRE vs DevOps, There is a conflict between the two , Who is better? ？ Which direction should I choose
Traditional engineers and teams can be renamed SRE The engineer / The team / department

SRE The five foundations of architecture

Insert picture description here

SLO What is it? ？

The service quality objective of the system defines the normal performance of the system
Focus on tracking customers （ people / machine ） Using experience of
If the customer is satisfied , that SLO It's up to the standard

Insert picture description here

What the system needs uptime It's a few 9？

2 individual 9 yes ：99%
3 individual 9 yes ：99.9%
4 individual 9 yes ：99.99%
5 individual 9 yes ：99.999%
6 individual 9 yes ：99.9999%
7 individual 9 yes ：99.99999%

SLA Uptime Online calculator ：
https://www.xarg.org/tools/sla-uptime-calculator/

SLO Level distribution

Insert picture description here

take SLI The measurement value is converted to SLO Percentile

face SLI The measurement , The units of monitoring indicators are inconsistent ;

Network traffic MB/s、 Disk write write/s、HTTP Respond to ms、 How long does the homepage of the website open s wait

Continuous measurement SLI The numerical , And will collect SLI Values are converted to values in different percentiles ：

In the recent 10 Within minutes ,SLI- Opening time of homepage ,P90(90%) The mean for 259ms
In the recent 10 Within minutes ,SLI- Opening time of homepage ,P99(99%) The mean for 589ms
In the recent 10 Within minutes ,SLI- Disk write ,P90(90%) The mean for 45 write/s
In the recent 10 Within minutes ,SLI- Disk write ,P99(99%) The mean for 12 write/s

reflection
SLI The measurement value is P90 and P99 The state of distribution , Is the customer satisfied ？

Insert picture description here

Wrong budget logic

Insert picture description here

Chicken eating game case analysis

Insert picture description here

Implementation oriented SLO System monitoring

Insert picture description here

Collect the indicators of load balancer

CloudWatch It can provide data collection
github Address ：https://github.com/prometheus/cloudwatch_exporter
use Prometheus Monitoring tools notation Express SLI, Part of the sample code is as follows ：
Insert picture description here

Index calculation

Insert picture description here

Use 4 Weekly data calculation initial SLO

Insert picture description here

establish SLO Relevant documents and communication process

Establish a formal for mobile game application system 《SLO file 》
– Gain recognition from all stakeholders ： The product manager 、 Developer 、 Operations staff
establish 《 Wrong budget strategy 》 file
– Consequence oriented , Authorized by the management ,SRE Have the right to stop the delivery of features , Have the right to return the operation and maintenance of the system to the development team
establish SLO Monitoring instrument panel 、 Report and wrong budget burnout chart
Continue to optimize SLO Goal setting , Continuously optimize the monitoring mode

be based on SLO Budget decisions and mistakes

Insert picture description here

SRE Working principles

SRE Need to design and implement consequence oriented SLO.
Any organization , Even one SRE No need to hire , Can design the wrong budget strategy .
This means identifying and using any hand that can prevent customers from experiencing pain points .
You can start implementing ： Measure 、 be responsible for 、 action

SRE Need time to optimize and improve .
once SRE Personnel are ready ： Make sure they know , Their job is not to continue to suffer the crime of operation and maintenance , Instead, optimize the operation and maintenance work every day .
” Smarter jobs “ It may mean doing different things ： It depends SRE What are the most useful and valuable work items you can find .

SRE Need to be able to regulate their workload .
SRE The team needs to be able to prioritize and work .
The maintenance of each new system requires labor costs .
Must be able to suppress unreliable work practices , Push back unreliable systems .

Insert picture description here