当前位置:网站首页>O & M, let go of monitoring - let go of yourself
O & M, let go of monitoring - let go of yourself
2022-07-06 09:26:00 【Java domain】
Based on years of experience dealing with operation and maintenance , I find , Operation and maintenance often makes monitoring ineffective ...
1. My surveillance story
I have worked in operation and maintenance for more than two years , Later, it will be transferred to the development of operation and maintenance platform , Also watch the monitoring system become more and more useless step by step .
1.1 Useful monitoring
When I do operation and maintenance, I am responsible oncall when , I always think that the monitoring system can do well , Not because I did too much , It's because the operation and maintenance business is still a single application , There is not much monitoring to add .
Remember that the company still uses Nagios( It is estimated that not many new people know ), However, the maintenance of monitoring is really laborious . Later I began to study zabbix, The biggest advantage is that it can discovery& Add monitoring automatically . I built another set in the back ELK, Collect the business logs , Monitoring is all alive . Because there are not many alarms added , Basically, every alarm in that meeting has to be handled , The most common problem is Baidu crawling data , I have a tried and true process : 1. Look at indicators : If it is xx The load of business is high , Yes 90% The probability is caused by reptiles 2. Log : stay kibana Look at the interview record , find topx Of IP paragraph 3. Block access : use iptables Seal off
This is my only operation and maintenance monitoring experience . Because the business is simple 、 Monitoring the original makes me feel that the alarm is useful .
1.2 Useless dashboard
1.2.1 Crazy Automation
When I transfer dimension development , I found that the demand of operation and maintenance for monitoring has also changed . Because of the improvement of Automation , Various open source monitoring systems are gradually improved , Operation and maintenance began to desperately add various automation requirements to the platform , For the monitoring system, it automatically binds various monitoring templates to the business 、 Alarm template 、grafana The dashboard
The result can also be imagined , Because there are too many alarms , Operation and maintenance directly blocks the company's alarm messages . In most cases, problems are found on the business side , Operation and maintenance will be involved in troubleshooting .
1.2.2 A nice but useless dashboard
Because there are too many indicator data collected , In order to output to the business side , Operation and maintenance started grafana The dashboard . But due to the grafana There are too many indicators on the dashboard , Pages will often get stuck , Business R & D looks at dozens of indicators on a page , I don't know which one works , Finally, I have to find o & M .
To facilitate R & D to view logs , Operation and maintenance is also done ELK, Collect all kinds of logs , And then kibana Lost to business research and development . The result can also be imagined , Except for a few who love to toss ,kibana Upper dashboard Not many people see .
I always believe that the original intention of operation and maintenance is good , But from the results , Hi, only O & M , After all, O & M seldom looks at their own dashboard ..
1.3 There is no qualitative change
With google sre The rise of concepts , Operation and maintenance seems to have found the last straw , It is, after all, google Operation and maintenance methodology . therefore , Operation and maintenance began to work with R & D to formulate various SLO、SLI indicators , basis 4 A gold indicator ( Delay 、 Traffic 、 Error and saturation ) Continue to enrich your alarm Library , And formulate P0、P1、P2 And other alarm classifications , Trying to change the current dilemma .
However, due to the microservicing of business architecture , And adopt the mode of agile development , In fact, the iteration speed of the business is very fast . Most of the sre I am not a developer , At the same time, the ratio is seriously insufficient ( R & D and operation and maintenance ratio ), It leads to the rapid failure of various indicators over time . The result is that the alarm is still useless , Every time you redo, you add another alarm , Of course, this alarm will hardly be triggered .
This is the monitoring story I experienced , What stories do you have ?
2. Prejudice against monitoring
In the process of summarizing the monitoring experience of these failures , I found two essential problems :
I have been trying to sum up the single problems that occurred in the past , To predict common problems that may occur in the future , Ignoring the complex changes in space and time in the future, we have been focusing on optimizing the traditional probe model ( Use scripts to test , Check recovery and alarm )、 Graphical trend display 、 Alarm model , And continuously improve the automation of relevant processes
The above questions only represent my current understanding of monitoring , I don't know right or wrong , There is no answer . Here are some of my prejudices about the current construction of the monitoring system .
2.1 Artificial intelligence or human-computer interaction
Drink coffee and make
What impresses me most about my former colleagues is this sentence . Half a year after saying this , He began to study AIOPS 了 , After another six months, he left , No one in the group mentioned it anymore AIOPS 了 . Most o & M is right AIOPS The biggest demand may be root cause analysis , But it's like a mountain standing on AIOPS Outside my door , Most o & M teams don't even have the courage to climb .
I haven't figured out a problem :
O & M itself may not be able to find out the cause of the problem , Why do you expect machines to do this .
Compared with machines , Machines are better at analyzing massive data , People are better at making decisions . So compared aiops I think human-computer interaction may be more reliable :
Machines comprehensively analyze massive amounts of data , The operation and maintenance department makes human brain decisions on the analysis results
But it's not easy , Because of the present sre The degree of obsession with development is no longer enough to do these things . The decision itself also needs to be sensitive to data .
2.2 Monitoring should focus on capacity building
In the past, in the construction of monitoring system , You generally like to do vertical segmentation according to the structure , It might look like this :
I think the main reason for this stratification is : Organizational structure ( Conway's law ) Separation of duties . Under this stratification , Operation and maintenance is usually only responsible for the lower two layers , Dealing with problems at the upper level , It may be located in a specific URL It's over , The rest is about research and development .
If we want to solve the current dilemma , I think we should abandon the past way of system construction according to responsibilities , For example, build a basic monitoring system 、 Network monitoring system 、 Business monitoring system , Instead, it turns to capacity-building around business value in stages , For example, basic data collection 、 transmission 、 analysis 、 Storage 、 Show other abilities . Transform into providing massive data collection and centralized rule Computing 、 A modern monitoring system with unified analysis and alarm capabilities 【google sre】 In the process of capacity-building , The platform team should aim at real needs , Build the smallest available platform (Thinnesr Viable Platform, TVP), And share best practices and actively empower users in the team , Gradually achieve excellent users . At the same time, we should avoid sharing methodologies that have not landed , After all, everyone is busy .
2.3 Try to be effective
When dealing with problems , You will find that the monitoring system of the company is more than you know , Operation and maintenance 、 Research and development 、DBA、redis Each department has its own monitoring system and dashboard , When something goes wrong , Everyone looks at the monitoring made by their own department . In order to establish a unified perspective , Capable companies will free up such things as unified monitoring : Get all kinds of data from different systems , Unified summary, analysis and storage , Finally, unified monitoring will bring real-time data 、 accuracy 、 Storage costs 、 Massive data processing and other new problems , And this matter will not be settled for a while .
But it really makes sense ? For this basic data collection 、 In fact, there are many commercialized solutions for analysis and storage , Why do you feel like a small team of several people , With a bunch of open source software , What we can do is better than a professional team of dozens of people , And it's so far from business , In addition to making your own kpi Better to see , It may not bring any other changes .
As more wheels are built , I also slowly find that the more ineffective I become , Have been wandering on the basic issues . Usually the more basic the problem , The more general the solution , To solve such problems at the same time ROI The lower , The more ineffective the work is . Don't overemphasize the particularity of your scene , Unless you just want to do some vanity indicators , Without solving the essential problem . So what is effective ? I think the core is :
Focus on users 、 Focus on the business , Abandon the past and solve common problems through experience induction , Try to use human-computer interaction of data analysis to focus on core business , And pass AI/ Automatic processing of supporting business and general business
But it's hard , Fortunately, I don't monitor ...
3. expectation
Last year, there was a hot direction related to monitoring : Observability . I don't have much practice on observability , But when talking about observability with friends, I found some problems , Here is more to write down their own confusion :
1. What problem does observability solve
Whenever we talk about observability , I found that everyone agreed that observability can solve all problems , It's like a dragon slaying knife , Where we pass, nothing grows . But if you ask in detail what you have done with observability , There will be a sense of going back in time , Back to various dashboards , The era of full screen indicators . Do you have an observable story ?
2. Data collection is in full bloom
Observability technology is developing very fast , There are more and more related open source projects , But there is a problem that surprises me in data collection : One day someone told me , It can be collected in the production environment profiling Do observability positioning business code problem . The surprise is not the technical implementation , It's about what kind of business needs this level of observability , Who is the user of this observability , What is the problem to be solved ? You have the answer ?
3. Old wine in new bottles
If you introduce observability to colleagues by metric、log、tracing When it is composed of three parts , It's easy to be old O & M diss, He will tell you that we all have it now , It's just not easy to use , Just enrich , There is no new technology , It's just old wine in new bottles . At this time, I usually ask google Previously sent about << Meaningful availability >> The problems mentioned in it , How to measure meaningful usability at the user level , Although I have no answer , But I just want to inspire thinking about the problem . How do you understand this problem ?
If you are interested and want to know more about the content and related learning materials, please like the collection + Comment forwarding + Pay attention to me , There will be a lot of dry goods in the back . I have some interview questions 、 framework 、 Design materials can be said to be necessary for programmer interview ! All the information has been put into the network disk , If necessary, please download ! I replied by private letter 【666】 Free access to
边栏推荐
- Improved deep embedded clustering with local structure preservation (Idec)
- Global and Chinese market of AVR series microcontrollers 2022-2028: Research Report on technology, participants, trends, market size and share
- Redis之哨兵模式
- Ijcai2022 collection of papers (continuously updated)
- Redis之发布订阅
- Design and implementation of online shopping system based on Web (attached: source code paper SQL file)
- Global and Chinese market of linear regulators 2022-2028: Research Report on technology, participants, trends, market size and share
- Redis之Lua脚本
- Redis之Bitmap
- Redis分布式锁实现Redisson 15问
猜你喜欢
随机推荐
The five basic data structures of redis are in-depth and application scenarios
Solve the problem of inconsistency between database field name and entity class attribute name (resultmap result set mapping)
英雄联盟轮播图手动轮播
[oc]- < getting started with UI> -- learning common controls
[OC foundation framework] - string and date and time >
How to intercept the string correctly (for example, intercepting the stock in operation by applying the error information)
Redis之五大基础数据结构深入、应用场景
有软件负载均衡,也有硬件负载均衡,选择哪个?
Implement window blocking on QWidget
Redis分布式锁实现Redisson 15问
Go redis initialization connection
The carousel component of ant design calls prev and next methods in TS (typescript) environment
Selenium+pytest automated test framework practice (Part 2)
Kratos战神微服务框架(二)
LeetCode41——First Missing Positive——hashing in place & swap
Le modèle sentinelle de redis
甘肃旅游产品预订增四倍:“绿马”走红,甘肃博物馆周边民宿一房难求
软件负载均衡和硬件负载均衡的选择
Advanced Computer Network Review(4)——Congestion Control of MPTCP
Opencv+dlib realizes "matching" glasses for Mona Lisa