当前位置:网站首页>Sre core system understanding
Sre core system understanding
2022-07-05 06:45:00 【Dream of finding flowers~~】
List of articles
- SRE The source of the
- SRE What is it? ?
- What is not SRE?
- SRE The five foundations of architecture
- SLO What is it? ?
- What the system needs uptime It's a few 9?
- SLO Level distribution
- take SLI The measurement value is converted to SLO Percentile
- Wrong budget logic
- Chicken eating game case analysis
- Implementation oriented SLO System monitoring
- Collect the indicators of load balancer
- Index calculation
- Use 4 Weekly data calculation initial SLO
- establish SLO Relevant documents and communication process
- be based on SLO Budget decisions and mistakes
- SRE Working principles
Reference resources B Link to station video materials
https://www.bilibili.com/video/BV1ak4y1975Z?spm_id_from=333.1007.top_right_bar_window_custom_collection.content.clickSRE The source of the
SRE What is it? ?
- SRE- Full name ”Site Reliability Engineering“, Station reliability engineering , Come of 2003 year
- A reliable framework for the operation and maintenance of large-scale systems
- It's about letting software engineers design the operation and maintenance functions
- Be responsible for the operation of the production system at the operation and maintenance level
- Build and operate high reliability systems 、 The best way to apply universally
What is not SRE?
- SRE The principle sounds good , however , It is not acclimatized here , oranges change with their environment , It can only grow in a specific culture , It only makes sense for super large scale
- SRE vs DevOps, There is a conflict between the two , Who is better? ? Which direction should I choose
- Traditional engineers and teams can be renamed SRE The engineer / The team / department
SRE The five foundations of architecture
SLO What is it? ?
- The service quality objective of the system defines the normal performance of the system
- Focus on tracking customers ( people / machine ) Using experience of
- If the customer is satisfied , that SLO It's up to the standard
What the system needs uptime It's a few 9?
- 2 individual 9 yes :99%
- 3 individual 9 yes :99.9%
- 4 individual 9 yes :99.99%
- 5 individual 9 yes :99.999%
- 6 individual 9 yes :99.9999%
- 7 individual 9 yes :99.99999%
SLA Uptime Online calculator :
https://www.xarg.org/tools/sla-uptime-calculator/
SLO Level distribution
take SLI The measurement value is converted to SLO Percentile
face SLI The measurement , The units of monitoring indicators are inconsistent ;
- Network traffic MB/s、 Disk write write/s、HTTP Respond to ms、 How long does the homepage of the website open s wait
Continuous measurement SLI The numerical , And will collect SLI Values are converted to values in different percentiles :
- In the recent 10 Within minutes ,SLI- Opening time of homepage ,P90(90%) The mean for 259ms
- In the recent 10 Within minutes ,SLI- Opening time of homepage ,P99(99%) The mean for 589ms
- In the recent 10 Within minutes ,SLI- Disk write ,P90(90%) The mean for 45 write/s
- In the recent 10 Within minutes ,SLI- Disk write ,P99(99%) The mean for 12 write/s
reflection
SLI The measurement value is P90 and P99 The state of distribution , Is the customer satisfied ?
Wrong budget logic
Chicken eating game case analysis
Implementation oriented SLO System monitoring
Collect the indicators of load balancer
CloudWatch It can provide data collection
github Address :https://github.com/prometheus/cloudwatch_exporter
use Prometheus Monitoring tools notation Express SLI, Part of the sample code is as follows :
Index calculation
Use 4 Weekly data calculation initial SLO
establish SLO Relevant documents and communication process
- Establish a formal for mobile game application system 《SLO file 》
– Gain recognition from all stakeholders : The product manager 、 Developer 、 Operations staff - establish 《 Wrong budget strategy 》 file
– Consequence oriented , Authorized by the management ,SRE Have the right to stop the delivery of features , Have the right to return the operation and maintenance of the system to the development team - establish SLO Monitoring instrument panel 、 Report and wrong budget burnout chart
- Continue to optimize SLO Goal setting , Continuously optimize the monitoring mode
be based on SLO Budget decisions and mistakes
SRE Working principles
SRE Need to design and implement consequence oriented SLO.
Any organization , Even one SRE No need to hire , Can design the wrong budget strategy .
This means identifying and using any hand that can prevent customers from experiencing pain points .
You can start implementing : Measure 、 be responsible for 、 action
SRE Need time to optimize and improve .
once SRE Personnel are ready : Make sure they know , Their job is not to continue to suffer the crime of operation and maintenance , Instead, optimize the operation and maintenance work every day .
” Smarter jobs “ It may mean doing different things : It depends SRE What are the most useful and valuable work items you can find .
SRE Need to be able to regulate their workload .
SRE The team needs to be able to prioritize and work .
The maintenance of each new system requires labor costs .
Must be able to suppress unreliable work practices , Push back unreliable systems .
边栏推荐
猜你喜欢
随机推荐
Vant Weapp SwipeCell設置多個按鈕
Mutual transformation between two-dimensional array and sparse array (sparse matrix)
LSA Type Explanation - lsa-1 [type 1 LSA - router LSA] detailed explanation
vsCode创建自己的代码模板
Alibaba established the enterprise digital intelligence service company "Lingyang" to focus on enterprise digital growth
How to make water ripple effect? This wave of water ripple effect pulls full of retro feeling
'mongoexport 'is not an internal or external command, nor is it a runnable program or batch file.
Alibaba's new member "Lingyang" officially appeared, led by Peng Xinyu, Alibaba's vice president, and assembled a number of core department technical teams
Record of problems in ollvm compilation
VLAN experiment
Find the combination number acwing 887 Find combination number III
UIO driven framework
vim
Design specification for mobile folding screen
LSA Type Explanation - lsa-5 (type 5 LSA - autonomous system external LSA) and lsa-4 (type 4 LSA - ASBR summary LSA) explanation
The route of wechat applet jumps again without triggering onload
Edge calculation data sorting
Game theory acwing 894 Split Nim game
P3265 [jloi2015] equipment purchase
Bit of MySQL_ OR、BIT_ Count function