当前位置:网站首页>Intelligent monitoring era - the way of monitoring construction

Intelligent monitoring era - the way of monitoring construction

2022-06-24 05:42:00 Tencent blue whale assistant

The way of monitoring construction

reminder : There are many contents in this article , It will probably cost you 10 Minutes to read .

This article contains the following :

1. What is monitoring - If you do not anticipate, you will be abandoned

2. Monitor the current situation of the construction - A journey of a thousand miles begins with a single step.

3. The challenge of monitoring construction - I will go up and down

4. The practice of monitoring construction - It's on paper

5. Summary of monitoring construction - I have experienced many battles and remembered the past

Before reading this article , Let's think about a question first - Almost every IT The company has its own operation and maintenance monitoring system , The operation and maintenance of each company is doing the monitoring system , And it seems that every family is facing a problem , The monitoring system is not easy to use , Can not solve the actual monitoring problems , Is there a better monitoring system ? The answer is yes , This article will give you the answer .

1. What is monitoring - If you do not anticipate, you will be abandoned

Operation and maintenance day : monitor , Major business events , The land of death , Way of survival , You can't ignore it [1].

Construction of monitoring , No less than preparation for a war , Whether it is the users who use monitoring , Or the construction monitoring personnel , Are faced with monitoring whether To use 、 Well done The real challenge . therefore , We must fully understand monitor .

monitor (Monitoring), seeing the name of a thing one thinks of its function , One is monitoring , Second, control , Focus on the first word “ prison ” On , Monitoring 、 Prevention means . In the field of computer operation and maintenance , Specifically, it refers to data sampling of the target state , So as to judge its operation status , We usually focus on the following monitoring data [2].

In this paper , After we focus on 4 Kind of monitoring , because APM Belong to specific field monitoring , We will not discuss it in detail here . that , Since users care about so much monitoring data , How to build a monitoring system that meets business needs ? Is there a monitoring system , It can fully meet the monitoring needs of users ?

2. Monitor the current situation of the construction - A journey of a thousand miles begins with a single step.

Construction of monitoring platform , It's a long process , Not overnight . Overall speaking , There are four ways to play , One is Based on the open source monitoring platform , Two is Use the business platform to build , The third is Secondary development of open source products , Fourth, Completely independent research and development . And every play , Have their own characteristics and limitations .

1. Build a monitoring platform based on open source monitoring software , Optional Nagios,Cacti,Ganglia,Zabbix,Graphite,Prometheus,TIGK(Telegraf、InfluxDB、Grafana、Kapacitor) etc. , Usually you just need to deploy open source software , Then enrich the collected data . It is characterized by open source , There are many community solutions , You can customize it .

2. Build a monitoring platform based on commercial software , In the business field of monitoring , In the past, it was basically the world of foreign companies , Such as IBM,HP, Zhuohao et al , But with the rise of local infrastructure manufacturers , Domestic manufacturers strive to be strong , There have been some excellent commercial monitoring software , Such wisdom , Easy to monitor ,OneAPM etc. , And some proprietary scene monitoring software . Its characteristic is that it only costs money , The corresponding monitoring service can be realized , It eliminates the repeated exploration of monitoring construction , It is suitable for complex monitoring scenarios , Lack of manpower , Projects in urgent need of monitoring solutions .

3. Build a monitoring platform based on the secondary development of open source software , Open source software has complete functions , If provided API, Provide data query interface , Can pass API Control, etc , Based on Zabbix、Prometheus、Open-falcon And so on , It can realize complete monitoring function and friendly management function . Its characteristic is that it can expand the monitoring and acquisition source on demand , On demand integration , Free customization , Not satisfied with the functions provided by the existing software , It can be flexibly customized according to the scene , It is suggested that “ Spending money on services won't solve the problem ” In the case of .

4. Based on independent research and development , Build a monitoring platform from scratch , This is a huge project , It is not a small challenge to technology and project management . Why do we need to develop a monitoring system from scratch , There may be several reasons ,1. Market monitoring software can not meet its business needs , Not functional enough , Performance does not meet , Management is not enough to support the development of its business and organizational structure ;2. Ecologically unable to meet the needs of business development ;3. There are risks in the copyright of secondary development based on open source software , Subject to others ;4. Business needs , Management support , Sufficient technical personnel , Both time, place and people have . It is characterized by a long development cycle , The gap between goal expectation and reality 、 Whether the development speed and business development speed can be followed up in time , There is a risk that the software project will get out of control if you are careless , The test is the project management level and project realization ability .

From the above four ways , In fact, the implementation cost is from low to high , From easy to difficult , And the specific method to be adopted , It needs to be decided according to the actual situation , that , Mainly depends on what ? Human resources 、 material resources 、 financial , It is also closely related to the stage the company is in . For example, the company has just started , Pursue speed and cost , It takes half a day to build an open source monitoring system , It is a wise choice , Which open source software to choose , You can choose what you are most familiar with , Those who use the most people . And the company has begun to take shape , There are many business demands , Choose commercial monitoring software and secondary development based on open source , It can be evaluated in detail according to specific business needs , The rule is , How much does it cost , How much profit do you get , Also consider that the return is long-term , Or short-term .

When the company develops to a certain scale , Its organizational structure and business requirements , Determines the software architecture requirements , Therefore, the monitoring system at this time , Must also have this ability , So this time , Building infrastructure is not just about business needs and product capabilities , It is an issue closely related to strategic planning , So I choose to develop it completely by myself , Or choose business 、 Secondary development of open source excellent products , Are optional directions , It depends on the technical reserve and the executive power of the organization , The right time, the right place and the right people are indispensable .

Just as “ Sun Tzu said : The method of using soldiers , Thousands of cars , Leather car thousand ride , Take a hundred thousand , Thousands of miles of food , Then the internal and external expenses , For the guests , The material of glue paint , Car armour , A thousand dollars a day , And then a hundred thousand teachers .[3]”, Independent development of corresponding monitoring , Project required , product , Design , Development , test , Operation and maintenance , Operation and other personnel participate together , And then after months , Out demo, Test verification , After one iteration after another , Then 100000 servers can be monitored . Independent development consumes human, material and financial resources , It's like preparing for a war , Don't act rashly , You must plan carefully before you act .

3. The challenge of monitoring construction - I will go up and down

In the 2 In the festival , We discussed the selection of monitoring construction scheme , that , Apart from the problems mentioned above , What problems will we have in the construction process ?

1. Key indicators of the system are missing

Monitoring construction is always a continuous process , There is no one size fits all solution . In the continuous monitoring operation and maintenance , We continue to enrich and improve relevant monitoring , Common system and application level monitoring indicators are as follows .

As can be seen from the picture above , The specific use of monitoring users' collection of monitoring indicators is very broad , No matter what kind of monitoring system is on the market , It provides default monitoring indicators , It may not meet the actual scenario requirements . With the continuous operation and business development of the monitoring system , The need to collect more monitoring indicators will become more and more urgent , therefore , We hope that the monitoring system can provide the ability of free expansion and flexible customization .

2. There are many difficulties in function expansion

In the process of continuous monitoring system construction , We constantly improve the collection indicators according to the actual needs . therefore , Does the monitoring platform natively support multiple collection methods , May limit our ability , Such as monitoring network 、 Storage and other hardware devices , We have to use SNMP Protocol to obtain monitoring indicator data , This should be a basic capability of the platform . If we have to realize it from zero , It is equivalent to writing a small monitoring software . therefore , This monitoring system should provide scalability , Or open source , Or the interface is open , We can extend modules and components according to actual requirements , So as to meet the needs of the continuous development of our business .

The continuous expansion of monitoring indicators , The amount of data that needs to be stored for indicator data will also be larger and larger , here , Of the monitoring system QPS, There are very high requirements . Whether the monitoring system can support highly concurrent requests , It directly determines whether the monitoring system can be used . Just imagine , If a monitoring system goes wrong in three days or two , Then users may be lost , Even abandon the use of this monitoring system , Turn to better solutions .

Monitor the user's use process , Expectations for the monitoring platform SLA May be 100%, What can actually be achieved SLA May be 99.9%( The annual shutdown is about 9 Hours ). With the continuous development of business and Technology , Monitoring users have higher and higher requirements for the monitoring system ,SLA Can we continue to improve ?

Time

99%

99.9%

99.99%

99.999%

Every day

14 minute 24 second

1 minute 26 second

9 second

1 second

Once a week

1 Hours 40 minute 48 second

10 minute 5 second

1 minute

6 second

monthly

7 Hours 12 minute

43 minute 12 second

4 minute 19 second

26 second

Every year,

3 God 15 Hours 36 minute

8 Hours 45 minute 36 second

52 minute 34 second

5 minute 15 second

actually ,SLA Raise 1 individual 9, The challenge to the system is very big , For example, does our architecture support , Is the architecture design reasonable , Is the architecture redundant , Can horizontal and horizontal expansion be supported , There is no single point of failure , Whether the server resources are sufficient , Whether the concurrency of the system is a straight line , And so on , It directly determines the monitoring system we provide SLA Can we continue to improve . Ideally , The architecture has redundancy , When the link fails , It can switch automatically , Be able to have a spare set-top to replace ; When the capacity is insufficient , If you can add servers, you can expand the capacity , And it can automatically load balance .

therefore , When we design the monitoring system , Be sure to learn about high concurrency in Internet architecture , High availability , Distributed architecture design . thereafter , Whether it is adding functional modules , Or the overall upgrade of the system , With the guarantee of Architecture , It can be upgraded and expanded on demand , Without worrying about system availability , The upgrade change extension is insensitive to users .

3. System reliability is not guaranteed

When the monitored host size reaches 5000 equipment ,1 Million devices , The general monitoring system will have bottlenecks , Systematic QPS Keep growing , Can it be supported 7*24 Hours 、365 Days of stable operation , It's a very big challenge , so to speak , The monitoring system has always been a highly concurrent system , meanwhile , It is also a large database system , For example, it is increasing day by day 5T,10T,50T The data of , And require detailed historical data , The storage cycle requires 7 God ,30 God , Even 1 year , And trend data ( Archive historical data , Such as by hour max,min,avg Storage ) It is required to keep 1 year ,2 Years or more , Then the data of the monitoring system may reach PB Data level , The way of data processing , And massive big data processing systems , The same is true , collection -> cleaning -> analysis -> Put in storage -> Use .

Data reporting delay , There are three general reasons , One is the problem of the collector , Unable to collect data according to the established period , Or because the original data does not exist , Or because the collector reaches its upper limit of performance ; The second is the cleaning of the monitoring system , The processing and analysis link is blocked , It shows that the collection and reporting are normal , Data has not been warehoused ; Third, after data processing , It shall not be warehoused normally , There is a problem with the monitored database , Data writing is slow , Slow query , Exceeded the upper limit of the database . this 3 In this case , Either way , For users , They're not available .

No false positives , No missing report , Without delay , This is the basic requirement for the monitoring system . False positives are problems in data processing , Let users reduce their trust in monitoring , If there are long-term false positives , Then users will lose their trust in monitoring , Gradually abandon this monitoring system . A missed alarm is an alarm that should have been sent but not sent , This situation is even more serious , It has seriously affected the normal use of users , It has seriously lowered the expectations of users , Like a late plane , Unable to reach the destination on time . Delay means that the alarm is now generated , Not until tomorrow , This situation indicates that the monitoring system is not available . When it should have failed , The alarm is not received , At the end of the fault , The alarm is sent , Users will not trust the monitoring system at all . If the monitoring system can not even do the basic thing of alarm well , Then it is not a qualified monitoring system , Users will treat the monitoring system as a noise .

When we report data , After all the alarm problems are solved , The system works properly , Will face new problems . User feedback , Can you make the alarm more intelligent ? Just imagine , Alarm module works normally , Users receive... Every day 1000 Alarm , Even received 10000 Alarm , Users will also go crazy , This is an alarm “ The bomber ”, Too many alarms , Become noise , Interfere with normal judgment . therefore , Whether the alarm can converge , Became a top priority .

Alarm convergence , It means that multiple policies are the same 、 Alarms with different target ranges are combined and sent , An alarm sending method that converges according to certain rules . such as , We have a network failure in a cloud area , It will make all equipment in this area inaccessible , that ping Unreachable alarm , Will be sent one by one , If the area is below 1000 A machine , It's sending 1 This alarm is good ? still 1000 A good one ? I believe that most normal people just want to receive 1 One important alarm is sufficient . Alarm convergence , The alarm will greatly improve the accuracy of the alarm , Let's make sure that we are planning strategies without panic , It will not make us nervous every day , Every day when the wolf comes , Because there are too many alarms , It has the effect of boiling frogs in warm water , Let us gradually lose sensitivity to alarms , Gradually, they will not pay attention to the alarms because of too many alarms .

With alarm convergence , Can you rest assured ? No , We also need fault correlation , Automatic fault analysis , Why do you need this function ? Just imagine , A rack loses power , Caused 15 All devices are down , Thus, a series of faults are caused , Such as API Overtime ,HTTP Dial test failed ,DB The number of connections has increased , Can you find root-cause Well ? Can you provide an important alarm to help us automatically analyze the root cause of the fault ? here , Fault correlation and automatic fault analysis , It's very important . therefore , The monitoring system , Must have the ability of fault correlation analysis , Provide more accurate information for our operation and maintenance decisions .

Besides , Whether the monitoring system can analyze the performance of the current environment , Analyze the system capacity , It will also be an important ability , Such as trend prediction , When should we expand the server for business , When should I shrink the server , Whether the current performance is sufficient , Is there room for optimization . And the monitoring system , Because there is data , This series of data can be provided as an important basis .

4. Technology lags behind business development

As the business continues to grow , The organizational structure will be adjusted according to the business form , Different people , Different permissions are required , More roles are needed , Such as super administrator , Hierarchical Administrators , General administrator , Ordinary users , Even more fine-grained permission control requirements for the menu button level . If the monitoring system does not take these requirements into account at the beginning , It is difficult to cope with the growth demand of the business . As the business grows , In order to reduce costs , You can outsource some routine things , here , Hierarchical control of permissions is particularly important . At this time , The monitoring system is no longer an isolated system , It must be integrated with the unified user login authentication system , Implementation configuration , Query separation , To meet the business development of the organization .

As the business continues to grow , The business puts forward higher requirements for the monitoring system , For example, requirements “ Microsecond monitoring data sampling ”, requirement “ Second level alarm ”, The monitoring system is required to provide 100% Reliable information , According to the system capacity index data provided by the monitoring system , Target servers for business applications 、 Expand or shrink the container . The monitoring system is an underlying dependent system , It will bring more value .

4. The practice of monitoring construction - It's on paper

When building the monitoring system , We understand the above problem , that , How do we solve these problems ? below , Let's take a detailed look at the specific practice [4].

1. Lower the use threshold - Open the box

For server devices , After system initialization , It is installed by default Agent, No configuration required , The host monitoring data can be collected , If in CMDB Process information is configured in , Process status data will be collected .

The basic process indicators are as follows

2. Say goodbye to the warning storm

When the host is connected by default , The default alarm policy can be automatically added , If the threshold is reached , An alarm can be generated , As shown below , Are some default alarm strategies .

To prevent too many alarms from interfering with users , We used four magic weapons of the brocade bag , Ensure that the alarm received by the user is valid .

Jinbaoyi - Alarm convergence . In the events shown below , It can be seen that the convergence rule effectively reduces the convergence of alarm events , The proportion of events and notifications generated 1:100, Even higher . The alarm convergence function is natively supported in design , The warning storm is effectively prevented .

Jinpao II - Alarm inhibit . Usually , Due to different actual needs of users , Different thresholds will be configured for the same alarm content . For example, configure disk space utilization alarms , Yes 80% early warning ,90% Warning ,95% serious , So this 3 How does this strategy work ? If the current disk space utilization has reached 96%, It's production 3 Is there an alarm notification ? In fact , If the monitoring system generates 3 Alarm , Then the monitoring system will be regarded as mentally retarded by users . therefore , For the same latitude strategy , We will only send alarms of alarm level , That is, only 95% Alert notification of severity level .

Jinbaosan - Alarm summary . Even if our monitoring system has alarm convergence , The two functions of alarm suppression , We still can't solve the problem of sending a large number of alarms at the same time , For example, multiple alarm rules are satisfied at the same time , Then the problem of warning flood and storm may still occur . therefore , For the alarm summary function , It is really necessary . For a large number of alarms pouring in at the same time , Summarize alarms in the same dimension , For different strategies , Then carry out combined alarm . With the alarm summary function , We can safely receive the alarm .

Jinbaosi - Alarm analysis . Through the mining and analysis of historical alarm data , We can also find abnormal alarms , In order to better analyze the original data and alarm threshold , So as to provide better data support for alarm configuration and no threshold alarm .

3. Easy function expansion - Plug and play

4. User rights control - On demand Authorization

It is divided into view and management according to the functions provided by monitoring , Generally speaking, there are the following user scenarios :

  • The recipient of the alarm notification : Such as operation and maintenance , Development , test , Products, etc. . It is applicable to application viewing and shielding operations .
  • Monitored by : Such as operation and maintenance . Applicable to application viewing + Management operations .
  • The manager of the monitoring platform : Global functions .

5. Automation builds the foundation - Efficiency first

It is different from other monitoring systems , Define acquisition , It can be done directly in the system , There is no need to use other third-party control systems or login servers to deploy . All configurations can be completed on the interface , Including plug-in writing , We just need to open the page to make a plug-in .

Dynamic acquisition target , When the number of hosts in the module increases or decreases , The plug-in can be automatically distributed to the target machine , The collection plug-in cannot be deployed manually .

Empathy , The target range in the alarm strategy is also automatically matched , Without policy editing for newly added hosts or modules , Scope automatically takes effect .

5. Summary of monitoring construction - I have experienced many battles and remembered the past

During the construction of the monitoring platform , After continuous practice , Constantly summarize experience , The functions of monitoring platform construction will gradually mature , however , If only the monitoring platform is built as the core goal , You will encounter problems that are not monitoring itself , For example, how to combine publishing and monitoring , How to shield alarms during publishing . How to make monitoring linkage CMDB, How to get through the monitoring system and the operation and maintenance automation system , How to get through the monitoring and pipeline release system , How to get through all links of monitoring and operation and maintenance ? This problem , It can not be solved by an independent monitoring system , It needs an extensible and customizable operation and maintenance platform to solve the problem .

Review the monitoring system to build , We summarize the following experience :

1. Set goals , Find out what you need , Please don't act blindly , Analyze business requirements carefully , Then select the corresponding monitoring scheme , Specify clear business requirements planning and project planning , Find out whether a chimney monitoring system or an operation and maintenance platform system is needed .

2. Try to use a mature and stable open source platform , For example, consider using Zabbix、Pormetheus etc. , If the monitoring scale is too large , There may be performance and other usage problems . here , The author recommends using blue whale [5] Such a mature and stable platform , It can solve the capacity and performance problems , Because the blue whale platform naturally supports massive concurrent scenarios , It can be expanded horizontally , Encountered the problem of insufficient capacity , You only need to expand the corresponding service module .

3. With the continuous development of business and the continuous iteration and update of Technology , The monitoring system will also be optimized and summarized , Iterative updating . therefore , There is no one-time solution , Only constant dynamic development balance .

4. The monitoring system should be made from the perspective of the software designer , Facing the monitoring demand , From the phenomenon to the essence, it can better meet the use scenarios , Not just as a tool , Because monitoring and operation and maintenance are closely related , From data production to consumption , And then to analyze the correlation , For the purpose of operation and maintenance -“ Efficiency improvement 、 Mention mass 、 Energy raising " And add bricks and tiles .

therefore , Operation and maintenance day : monitor , Major business events , The land of death , Way of survival , You can't ignore it [1].

Download experience

Welcome to the official website of blue whale Zhiyun (https://bk.tencent.com/download/ ), Download Community Edition 6.0 Version to experience .

Reference material

[1]. 《 Sun Tzu's art of war . A plan 》 First sentence

[2]. https://en.wikipedia.org/wiki/Monitoring

[3]. 《 Sun Tzu's art of war . The war 》

[4]. https://bk.tencent.com/docs/

[5]. https://bk.tencent.com

原网站

版权声明
本文为[Tencent blue whale assistant]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/08/20210805173550249t.html

随机推荐