当前位置:网站首页>Intelligent monitoring era - the way of monitoring construction
Intelligent monitoring era - the way of monitoring construction
2022-06-24 05:42:00 【Tencent blue whale assistant】
The way of monitoring construction
reminder : There are many contents in this article , It will probably cost you 10 Minutes to read .
This article contains the following :
1. What is monitoring - If you do not anticipate, you will be abandoned
2. Monitor the current situation of the construction - A journey of a thousand miles begins with a single step.
3. The challenge of monitoring construction - I will go up and down
4. The practice of monitoring construction - It's on paper
5. Summary of monitoring construction - I have experienced many battles and remembered the past
Before reading this article , Let's think about a question first - Almost every IT The company has its own operation and maintenance monitoring system , The operation and maintenance of each company is doing the monitoring system , And it seems that every family is facing a problem , The monitoring system is not easy to use , Can not solve the actual monitoring problems , Is there a better monitoring system ? The answer is yes , This article will give you the answer .
1. What is monitoring - If you do not anticipate, you will be abandoned
Operation and maintenance day : monitor , Major business events , The land of death , Way of survival , You can't ignore it [1].
Construction of monitoring , No less than preparation for a war , Whether it is the users who use monitoring , Or the construction monitoring personnel , Are faced with monitoring whether To use 、 Well done The real challenge . therefore , We must fully understand monitor .
monitor (Monitoring), seeing the name of a thing one thinks of its function , One is monitoring , Second, control , Focus on the first word “ prison ” On , Monitoring 、 Prevention means . In the field of computer operation and maintenance , Specifically, it refers to data sampling of the target state , So as to judge its operation status , We usually focus on the following monitoring data [2].
In this paper , After we focus on 4 Kind of monitoring , because APM Belong to specific field monitoring , We will not discuss it in detail here . that , Since users care about so much monitoring data , How to build a monitoring system that meets business needs ? Is there a monitoring system , It can fully meet the monitoring needs of users ?
2. Monitor the current situation of the construction - A journey of a thousand miles begins with a single step.
Construction of monitoring platform , It's a long process , Not overnight . Overall speaking , There are four ways to play , One is Based on the open source monitoring platform , Two is Use the business platform to build , The third is Secondary development of open source products , Fourth, Completely independent research and development . And every play , Have their own characteristics and limitations .
1. Build a monitoring platform based on open source monitoring software , Optional Nagios,Cacti,Ganglia,Zabbix,Graphite,Prometheus,TIGK(Telegraf、InfluxDB、Grafana、Kapacitor) etc. , Usually you just need to deploy open source software , Then enrich the collected data . It is characterized by open source , There are many community solutions , You can customize it .
2. Build a monitoring platform based on commercial software , In the business field of monitoring , In the past, it was basically the world of foreign companies , Such as IBM,HP, Zhuohao et al , But with the rise of local infrastructure manufacturers , Domestic manufacturers strive to be strong , There have been some excellent commercial monitoring software , Such wisdom , Easy to monitor ,OneAPM etc. , And some proprietary scene monitoring software . Its characteristic is that it only costs money , The corresponding monitoring service can be realized , It eliminates the repeated exploration of monitoring construction , It is suitable for complex monitoring scenarios , Lack of manpower , Projects in urgent need of monitoring solutions .
3. Build a monitoring platform based on the secondary development of open source software , Open source software has complete functions , If provided API, Provide data query interface , Can pass API Control, etc , Based on Zabbix、Prometheus、Open-falcon And so on , It can realize complete monitoring function and friendly management function . Its characteristic is that it can expand the monitoring and acquisition source on demand , On demand integration , Free customization , Not satisfied with the functions provided by the existing software , It can be flexibly customized according to the scene , It is suggested that “ Spending money on services won't solve the problem ” In the case of .
4. Based on independent research and development , Build a monitoring platform from scratch , This is a huge project , It is not a small challenge to technology and project management . Why do we need to develop a monitoring system from scratch , There may be several reasons ,1. Market monitoring software can not meet its business needs , Not functional enough , Performance does not meet , Management is not enough to support the development of its business and organizational structure ;2. Ecologically unable to meet the needs of business development ;3. There are risks in the copyright of secondary development based on open source software , Subject to others ;4. Business needs , Management support , Sufficient technical personnel , Both time, place and people have . It is characterized by a long development cycle , The gap between goal expectation and reality 、 Whether the development speed and business development speed can be followed up in time , There is a risk that the software project will get out of control if you are careless , The test is the project management level and project realization ability .
From the above four ways , In fact, the implementation cost is from low to high , From easy to difficult , And the specific method to be adopted , It needs to be decided according to the actual situation , that , Mainly depends on what ? Human resources 、 material resources 、 financial , It is also closely related to the stage the company is in . For example, the company has just started , Pursue speed and cost , It takes half a day to build an open source monitoring system , It is a wise choice , Which open source software to choose , You can choose what you are most familiar with , Those who use the most people . And the company has begun to take shape , There are many business demands , Choose commercial monitoring software and secondary development based on open source , It can be evaluated in detail according to specific business needs , The rule is , How much does it cost , How much profit do you get , Also consider that the return is long-term , Or short-term .
When the company develops to a certain scale , Its organizational structure and business requirements , Determines the software architecture requirements , Therefore, the monitoring system at this time , Must also have this ability , So this time , Building infrastructure is not just about business needs and product capabilities , It is an issue closely related to strategic planning , So I choose to develop it completely by myself , Or choose business 、 Secondary development of open source excellent products , Are optional directions , It depends on the technical reserve and the executive power of the organization , The right time, the right place and the right people are indispensable .
Just as “ Sun Tzu said : The method of using soldiers , Thousands of cars , Leather car thousand ride , Take a hundred thousand , Thousands of miles of food , Then the internal and external expenses , For the guests , The material of glue paint , Car armour , A thousand dollars a day , And then a hundred thousand teachers .[3]”, Independent development of corresponding monitoring , Project required , product , Design , Development , test , Operation and maintenance , Operation and other personnel participate together , And then after months , Out demo, Test verification , After one iteration after another , Then 100000 servers can be monitored . Independent development consumes human, material and financial resources , It's like preparing for a war , Don't act rashly , You must plan carefully before you act .
3. The challenge of monitoring construction - I will go up and down
In the 2 In the festival , We discussed the selection of monitoring construction scheme , that , Apart from the problems mentioned above , What problems will we have in the construction process ?
1. Key indicators of the system are missing
Monitoring construction is always a continuous process , There is no one size fits all solution . In the continuous monitoring operation and maintenance , We continue to enrich and improve relevant monitoring , Common system and application level monitoring indicators are as follows .
As can be seen from the picture above , The specific use of monitoring users' collection of monitoring indicators is very broad , No matter what kind of monitoring system is on the market , It provides default monitoring indicators , It may not meet the actual scenario requirements . With the continuous operation and business development of the monitoring system , The need to collect more monitoring indicators will become more and more urgent , therefore , We hope that the monitoring system can provide the ability of free expansion and flexible customization .
2. There are many difficulties in function expansion
In the process of continuous monitoring system construction , We constantly improve the collection indicators according to the actual needs . therefore , Does the monitoring platform natively support multiple collection methods , May limit our ability , Such as monitoring network 、 Storage and other hardware devices , We have to use SNMP Protocol to obtain monitoring indicator data , This should be a basic capability of the platform . If we have to realize it from zero , It is equivalent to writing a small monitoring software . therefore , This monitoring system should provide scalability , Or open source , Or the interface is open , We can extend modules and components according to actual requirements , So as to meet the needs of the continuous development of our business .
The continuous expansion of monitoring indicators , The amount of data that needs to be stored for indicator data will also be larger and larger , here , Of the monitoring system QPS, There are very high requirements . Whether the monitoring system can support highly concurrent requests , It directly determines whether the monitoring system can be used . Just imagine , If a monitoring system goes wrong in three days or two , Then users may be lost , Even abandon the use of this monitoring system , Turn to better solutions .
Monitor the user's use process , Expectations for the monitoring platform SLA May be 100%, What can actually be achieved SLA May be 99.9%( The annual shutdown is about 9 Hours ). With the continuous development of business and Technology , Monitoring users have higher and higher requirements for the monitoring system ,SLA Can we continue to improve ?
Time | 99% | 99.9% | 99.99% | 99.999% |
|---|---|---|---|---|
Every day | 14 minute 24 second | 1 minute 26 second | 9 second | 1 second |
Once a week | 1 Hours 40 minute 48 second | 10 minute 5 second | 1 minute | 6 second |
monthly | 7 Hours 12 minute | 43 minute 12 second | 4 minute 19 second | 26 second |
Every year, | 3 God 15 Hours 36 minute | 8 Hours 45 minute 36 second | 52 minute 34 second | 5 minute 15 second |
actually ,SLA Raise 1 individual 9, The challenge to the system is very big , For example, does our architecture support , Is the architecture design reasonable , Is the architecture redundant , Can horizontal and horizontal expansion be supported , There is no single point of failure , Whether the server resources are sufficient , Whether the concurrency of the system is a straight line , And so on , It directly determines the monitoring system we provide SLA Can we continue to improve . Ideally , The architecture has redundancy , When the link fails , It can switch automatically , Be able to have a spare set-top to replace ; When the capacity is insufficient , If you can add servers, you can expand the capacity , And it can automatically load balance .
therefore , When we design the monitoring system , Be sure to learn about high concurrency in Internet architecture , High availability , Distributed architecture design . thereafter , Whether it is adding functional modules , Or the overall upgrade of the system , With the guarantee of Architecture , It can be upgraded and expanded on demand , Without worrying about system availability , The upgrade change extension is insensitive to users .
3. System reliability is not guaranteed
When the monitored host size reaches 5000 equipment ,1 Million devices , The general monitoring system will have bottlenecks , Systematic QPS Keep growing , Can it be supported 7*24 Hours 、365 Days of stable operation , It's a very big challenge , so to speak , The monitoring system has always been a highly concurrent system , meanwhile , It is also a large database system , For example, it is increasing day by day 5T,10T,50T The data of , And require detailed historical data , The storage cycle requires 7 God ,30 God , Even 1 year , And trend data ( Archive historical data , Such as by hour max,min,avg Storage ) It is required to keep 1 year ,2 Years or more , Then the data of the monitoring system may reach PB Data level , The way of data processing , And massive big data processing systems , The same is true , collection -> cleaning -> analysis -> Put in storage -> Use .
Data reporting delay , There are three general reasons , One is the problem of the collector , Unable to collect data according to the established period , Or because the original data does not exist , Or because the collector reaches its upper limit of performance ; The second is the cleaning of the monitoring system , The processing and analysis link is blocked , It shows that the collection and reporting are normal , Data has not been warehoused ; Third, after data processing , It shall not be warehoused normally , There is a problem with the monitored database , Data writing is slow , Slow query , Exceeded the upper limit of the database . this 3 In this case , Either way , For users , They're not available .
No false positives , No missing report , Without delay , This is the basic requirement for the monitoring system . False positives are problems in data processing , Let users reduce their trust in monitoring , If there are long-term false positives , Then users will lose their trust in monitoring , Gradually abandon this monitoring system . A missed alarm is an alarm that should have been sent but not sent , This situation is even more serious , It has seriously affected the normal use of users , It has seriously lowered the expectations of users , Like a late plane , Unable to reach the destination on time . Delay means that the alarm is now generated , Not until tomorrow , This situation indicates that the monitoring system is not available . When it should have failed , The alarm is not received , At the end of the fault , The alarm is sent , Users will not trust the monitoring system at all . If the monitoring system can not even do the basic thing of alarm well , Then it is not a qualified monitoring system , Users will treat the monitoring system as a noise .
When we report data , After all the alarm problems are solved , The system works properly , Will face new problems . User feedback , Can you make the alarm more intelligent ? Just imagine , Alarm module works normally , Users receive... Every day 1000 Alarm , Even received 10000 Alarm , Users will also go crazy , This is an alarm “ The bomber ”, Too many alarms , Become noise , Interfere with normal judgment . therefore , Whether the alarm can converge , Became a top priority .
Alarm convergence , It means that multiple policies are the same 、 Alarms with different target ranges are combined and sent , An alarm sending method that converges according to certain rules . such as , We have a network failure in a cloud area , It will make all equipment in this area inaccessible , that ping Unreachable alarm , Will be sent one by one , If the area is below 1000 A machine , It's sending 1 This alarm is good ? still 1000 A good one ? I believe that most normal people just want to receive 1 One important alarm is sufficient . Alarm convergence , The alarm will greatly improve the accuracy of the alarm , Let's make sure that we are planning strategies without panic , It will not make us nervous every day , Every day when the wolf comes , Because there are too many alarms , It has the effect of boiling frogs in warm water , Let us gradually lose sensitivity to alarms , Gradually, they will not pay attention to the alarms because of too many alarms .
With alarm convergence , Can you rest assured ? No , We also need fault correlation , Automatic fault analysis , Why do you need this function ? Just imagine , A rack loses power , Caused 15 All devices are down , Thus, a series of faults are caused , Such as API Overtime ,HTTP Dial test failed ,DB The number of connections has increased , Can you find root-cause Well ? Can you provide an important alarm to help us automatically analyze the root cause of the fault ? here , Fault correlation and automatic fault analysis , It's very important . therefore , The monitoring system , Must have the ability of fault correlation analysis , Provide more accurate information for our operation and maintenance decisions .
Besides , Whether the monitoring system can analyze the performance of the current environment , Analyze the system capacity , It will also be an important ability , Such as trend prediction , When should we expand the server for business , When should I shrink the server , Whether the current performance is sufficient , Is there room for optimization . And the monitoring system , Because there is data , This series of data can be provided as an important basis .
4. Technology lags behind business development
As the business continues to grow , The organizational structure will be adjusted according to the business form , Different people , Different permissions are required , More roles are needed , Such as super administrator , Hierarchical Administrators , General administrator , Ordinary users , Even more fine-grained permission control requirements for the menu button level . If the monitoring system does not take these requirements into account at the beginning , It is difficult to cope with the growth demand of the business . As the business grows , In order to reduce costs , You can outsource some routine things , here , Hierarchical control of permissions is particularly important . At this time , The monitoring system is no longer an isolated system , It must be integrated with the unified user login authentication system , Implementation configuration , Query separation , To meet the business development of the organization .
As the business continues to grow , The business puts forward higher requirements for the monitoring system , For example, requirements “ Microsecond monitoring data sampling ”, requirement “ Second level alarm ”, The monitoring system is required to provide 100% Reliable information , According to the system capacity index data provided by the monitoring system , Target servers for business applications 、 Expand or shrink the container . The monitoring system is an underlying dependent system , It will bring more value .
4. The practice of monitoring construction - It's on paper
When building the monitoring system , We understand the above problem , that , How do we solve these problems ? below , Let's take a detailed look at the specific practice [4].
1. Lower the use threshold - Open the box
For server devices , After system initialization , It is installed by default Agent, No configuration required , The host monitoring data can be collected , If in CMDB Process information is configured in , Process status data will be collected .
The basic process indicators are as follows
2. Say goodbye to the warning storm
When the host is connected by default , The default alarm policy can be automatically added , If the threshold is reached , An alarm can be generated , As shown below , Are some default alarm strategies .
To prevent too many alarms from interfering with users , We used four magic weapons of the brocade bag , Ensure that the alarm received by the user is valid .
Jinbaoyi - Alarm convergence . In the events shown below , It can be seen that the convergence rule effectively reduces the convergence of alarm events , The proportion of events and notifications generated 1:100, Even higher . The alarm convergence function is natively supported in design , The warning storm is effectively prevented .
Jinpao II - Alarm inhibit . Usually , Due to different actual needs of users , Different thresholds will be configured for the same alarm content . For example, configure disk space utilization alarms , Yes 80% early warning ,90% Warning ,95% serious , So this 3 How does this strategy work ? If the current disk space utilization has reached 96%, It's production 3 Is there an alarm notification ? In fact , If the monitoring system generates 3 Alarm , Then the monitoring system will be regarded as mentally retarded by users . therefore , For the same latitude strategy , We will only send alarms of alarm level , That is, only 95% Alert notification of severity level .
Jinbaosan - Alarm summary . Even if our monitoring system has alarm convergence , The two functions of alarm suppression , We still can't solve the problem of sending a large number of alarms at the same time , For example, multiple alarm rules are satisfied at the same time , Then the problem of warning flood and storm may still occur . therefore , For the alarm summary function , It is really necessary . For a large number of alarms pouring in at the same time , Summarize alarms in the same dimension , For different strategies , Then carry out combined alarm . With the alarm summary function , We can safely receive the alarm .
Jinbaosi - Alarm analysis . Through the mining and analysis of historical alarm data , We can also find abnormal alarms , In order to better analyze the original data and alarm threshold , So as to provide better data support for alarm configuration and no threshold alarm .
3. Easy function expansion - Plug and play
4. User rights control - On demand Authorization
It is divided into view and management according to the functions provided by monitoring , Generally speaking, there are the following user scenarios :
- The recipient of the alarm notification : Such as operation and maintenance , Development , test , Products, etc. . It is applicable to application viewing and shielding operations .
- Monitored by : Such as operation and maintenance . Applicable to application viewing + Management operations .
- The manager of the monitoring platform : Global functions .
5. Automation builds the foundation - Efficiency first
It is different from other monitoring systems , Define acquisition , It can be done directly in the system , There is no need to use other third-party control systems or login servers to deploy . All configurations can be completed on the interface , Including plug-in writing , We just need to open the page to make a plug-in .
Dynamic acquisition target , When the number of hosts in the module increases or decreases , The plug-in can be automatically distributed to the target machine , The collection plug-in cannot be deployed manually .
Empathy , The target range in the alarm strategy is also automatically matched , Without policy editing for newly added hosts or modules , Scope automatically takes effect .
5. Summary of monitoring construction - I have experienced many battles and remembered the past
During the construction of the monitoring platform , After continuous practice , Constantly summarize experience , The functions of monitoring platform construction will gradually mature , however , If only the monitoring platform is built as the core goal , You will encounter problems that are not monitoring itself , For example, how to combine publishing and monitoring , How to shield alarms during publishing . How to make monitoring linkage CMDB, How to get through the monitoring system and the operation and maintenance automation system , How to get through the monitoring and pipeline release system , How to get through all links of monitoring and operation and maintenance ? This problem , It can not be solved by an independent monitoring system , It needs an extensible and customizable operation and maintenance platform to solve the problem .
Review the monitoring system to build , We summarize the following experience :
1. Set goals , Find out what you need , Please don't act blindly , Analyze business requirements carefully , Then select the corresponding monitoring scheme , Specify clear business requirements planning and project planning , Find out whether a chimney monitoring system or an operation and maintenance platform system is needed .
2. Try to use a mature and stable open source platform , For example, consider using Zabbix、Pormetheus etc. , If the monitoring scale is too large , There may be performance and other usage problems . here , The author recommends using blue whale [5] Such a mature and stable platform , It can solve the capacity and performance problems , Because the blue whale platform naturally supports massive concurrent scenarios , It can be expanded horizontally , Encountered the problem of insufficient capacity , You only need to expand the corresponding service module .
3. With the continuous development of business and the continuous iteration and update of Technology , The monitoring system will also be optimized and summarized , Iterative updating . therefore , There is no one-time solution , Only constant dynamic development balance .
4. The monitoring system should be made from the perspective of the software designer , Facing the monitoring demand , From the phenomenon to the essence, it can better meet the use scenarios , Not just as a tool , Because monitoring and operation and maintenance are closely related , From data production to consumption , And then to analyze the correlation , For the purpose of operation and maintenance -“ Efficiency improvement 、 Mention mass 、 Energy raising " And add bricks and tiles .
therefore , Operation and maintenance day : monitor , Major business events , The land of death , Way of survival , You can't ignore it [1].
Download experience
Welcome to the official website of blue whale Zhiyun (https://bk.tencent.com/download/ ), Download Community Edition 6.0 Version to experience .
Reference material
[1]. 《 Sun Tzu's art of war . A plan 》 First sentence
[2]. https://en.wikipedia.org/wiki/Monitoring
[3]. 《 Sun Tzu's art of war . The war 》
边栏推荐
- [Tencent cloud] enterprise micro marketing, private domain traffic value growth and operation efficiency improvement
- What is the domain name of the website? What problems should be paid attention to when applying for a domain name
- How to register a company domain name how to build a website with a domain name
- When a beef cow has an "electronic ID card"
- Learning routes and materials for cloud native O & M engineers
- How to file a personal domain name? What are the benefits of domain name filing?
- How to make a secondary domain name? What are the advantages of secondary domain names?
- How to check the school domain name? Are all school domain names unified?
- Bert series Roberta Albert erine detailed explanation and use learning notes
- How to buy a network domain name? Is the domain expensive
猜你喜欢

Answer questions! This article explains the automated testing framework in software testing from beginning to end
Easy to understand JDBC tutorial - absolutely suitable for zero Foundation
Learning routes and materials for cloud native O & M engineers

How should we learn cloud native in 2022?
What cloud native knowledge should programmers master?
随机推荐
The 2021 smart Expo is about to open. Tencent Youtu and "Ai Gallery" will "Chongqing" with you
When we talk about zero trust, what are we talking about?
Spirit breath development log (8)
How to change the domain name and why to rush to register the domain name
Technical dry goods | understand go memory allocation
"Adobe international certified" graphic designer! How to break through the creative barrier and gain both fame and wealth?
How to resolve the domain name to IP? How long does it take for the domain name resolution to take effect?
How to apply for a website domain name and what problems should be paid attention to
How to register an overseas domain name what should be paid attention to when registering a domain name
It is necessary to do the industry of waiting insurance evaluation. Let's see if you are on the list
How do users in the insurance upgrade industry choose?
How to check the domain name of the website? Are there any skills to speak of
Lightweight toss plan 3, develop in the browser - build your own development bucket (Part 1)
5g/4g data acquisition telemetry terminal
Edgegallery: MEC open source platform extends 5g capability to the edge
Analysis and summary of the packet capturing artifact tcpdump - covering major use scenarios and advanced usage
How to build a website with a domain name? What are the precautions for website construction?
How does the company domain name come from? What kind of domain name is a good domain name
Massif tool of Valgrind
Talk about my working experience in Tencent and byte