当前位置:网站首页>How large and medium-sized enterprises build their own monitoring system
How large and medium-sized enterprises build their own monitoring system
2022-06-24 10:05:00 【51CTO】
In recent years, I have been responsible for the construction of the company's monitoring system , from 0-1 Build a distributed monitoring system for the whole company . Before that 51cto Blogs also post a lot about Zabbix Distributed monitoring related articles , Many articles are highly praised and read , A great sense of accomplishment . Everybody knows , Internet technology is updated very quickly , In recent years docker、k8s、 Microservices and CICD It is the mainstream technology of each company . Various applications 、 Microservices and docker And so on , Business invocation and fault diagnosis will be more complicated . How to choose and use various excellent monitoring software to solve practical problems , Let monitoring go ahead of the business , To serve the business , It is a problem that every O & M and even leaders should think about . Perfect monitoring system can stop loss in time , Give play to the value of operation and maintenance personnel , The importance of system monitoring . A complete set of monitoring system has been summarized and used in many years of practice , It mainly includes host monitoring zabbix、 Log monitoring graylog、 Monitoring of microservices prometheus、 Business monitoring skywalking、 A third party APM Dial test and centralized display tools grafana platform . It has formed a set of monitoring systems applied by large Internet companies .
1)Zabbix Distributed monitoring system : The system starts from the initial 2.2.1->3.2.0->4.0->5.0 Up to now 5.0 The version is upgraded smoothly and iteratively . Whether in function 、 Excellent performance and stability . In recent years, both hardware 、 System 、 The Internet 、 Storage is still a problem with middleware .Zabbix Monitoring can give timely and accurate alarm , It has played a great value for the stable operation of the company's business . At present, the whole station equipment monitoring has been realized . In terms of server hardware monitoring HP Management tools +SNMPget+Zabbix Automatic discovery function to realize the server hard disk 、 Power Supply 、 Fan and ambient temperature monitoring , Our monitoring scheme is also the best solution in the industry . In the core machine 、 In terms of storage and other equipment monitoring, we pass SNMPTrap The function is realized , comparison SNMPget Acquire data regularly ,snmpTrap More advantages , The alarm will be triggered immediately in case of any question , It is more suitable for the monitoring of core equipment . In the aspect of middleware monitoring , We collected the most representative performance indicators of the service and controlled the customized alarm conditions according to our years of operation and maintenance experience . In terms of automatic monitoring , Through a lot of python|Shell The program writes many automatic scanning programs and automatically adds them to the monitoring system , For example, the whole station url Availability monitoring 、http Certificate monitoring 、 Middleware multi instance monitoring, etc , Reduce human error and workload . In monitoring data storage and performance optimization , By automating the day-to-day partitioning MySQL Performance issues , By way of MySQL Innodb The storage engine is converted to Tokudb Storage engine to reduce the use of disk space , Through the conversion of this engine, you can save 80% Left and right hard disk space , Solve the problem of data storage . In terms of monitoring and alarm storm control , adopt Python The program pushes the alarm message to redis cache , Then take the weight according to the conditions 、 Combine and combine the information in the asset management system to alarm . The control function of warning storm, which is the most difficult problem of the monitoring system, is realized . In the aspect of improving the distributed monitoring system , Always insist on finding problems -> Location problem -> Summary questions -> Whether to optimize the idea of monitoring , Basic familiarity zabbix Official website , And accumulated rich practical experience .
2)Graylog Log collection system : although Zabbix Support log monitoring , Because in the data volume 、 Search and log display are relatively weak , Only simple log alarms can be made . Therefore, log monitoring still needs to be done through professional tools .Graylog An open source log aggregation 、 analysis 、 Audit 、 Presentation and early warning tools . comparison ELK,Graylog A lightweight ,UI The interface is more beautiful , There are abundant and perfect API Interface . By looking at Dashboards The report can confirm whether there are problems with thousands of online devices . Currently, the logs of network devices are collected 、MySQL Error log 、Linux System logs, etc . Various logs are statistically analyzed according to the error level , The corresponding advanced log passes Graylog +Python The program realizes the alarm function of wechat and e-mail . In several online businesses MQ System kernel crash 、 The system file system is corrupt 、CPU In case of problems such as soft lock and power module of network equipment, the relevant personnel shall be warned in time through log alarm .
3)Skywalking Full link service monitoring system : There is no need to develop or modify the source code for access monitoring , Just introduce skywalking Of jar Bag can . With the promotion of the company's Micro Services , Business invocation and fault diagnosis will be more complicated .skywalking Monitoring is mainly used to monitor the user request link and path ( Topology ), It can track whether each link of the calling link is normal ( Reason for the error ) And time consuming (DB Inquire about 、 Cache queries, etc ), It is mainly used for business level performance optimization and fault diagnosis . The alarms of different projects are customized , And push the alarm to the relevant responsible person , adopt Python The program realizes the dual channel alarm of e-mail and wechat .
4)Prometheus The monitoring system : comparison zabbix system ,prometheus The monitoring of microservices has more advantages , and Zabbix The monitoring system , Learn from others' strong points and close the gap , Give full play to their advantages . And actively communicate with developers and promote the company's Micro service monitoring , Realized prometheus adopt nacos The registry automatically obtains the registered microservices , It mainly monitors JVM Load at the system level 、 Detailed heap memory 、 Connection pool performance indicators 、 Each of the following micro services URL The number of calls and the status code returned , Statistics on the number of error logs of microservices . Cooperate with the third party granfana Large screen display , It is mainly used for system level fault diagnosis and performance tuning .
5)Grafana Front end unified display platform :grafana Is an open source professional data display tool , Beautiful interface , To use . at present Grafana It's connected Zabbix System 、Permontheus、MySQL The data in it . It well complements the deficiency of monitoring in data display . Passed before grafana It is particularly slow to obtain monitoring data for more than one day , adopt google To optimize the grafana Parameters , Getting data now is especially fast , It is especially convenient for viewing data according to the whole channel 、 To use .
6) The third party APM Dial test monitoring :APM The dial-up test mainly uses... Provided by a third party LM Monitor website performance and availability across the country . It is also used for the evaluation and selection of third-party services , such as IDC The computer room and CDN Model selection, etc , In evaluation CDN We mainly focus on the overall performance and availability of image opening , In evaluation IDC The computer room mainly evaluates the download speed 、 Usability 、 And network packet loss rate and delay .
summary : In many years of operation and maintenance monitoring work , I've had all kinds of problems . Each fault will start from the symptom of the fault 、 reason 、 How to check 、 How to solve 、 How to improve the idea of monitoring . Have a deep research on the key performance indicators of the system and business . For example, mention the performance bottleneck of the system ,redis Cluster performance bottlenecks 、 The performance bottleneck of the message queue will immediately think of the key indicators and conduct troubleshooting through system commands and monitoring . I will continue to learn later k8s Relevant knowledge , Complete monitoring of the whole business life cycle of the company ( The underlying hardware 、 System 、 The Internet 、 middleware 、 Microservices 、 Business, etc ). Continuously improve their ability to analyze and solve problems .
边栏推荐
猜你喜欢

How does home office manage the data center network infrastructure?

SSH Remote Password free login

Wechat applet learning to achieve list rendering and conditional rendering

操作符详解

411-栈和队列(20. 有效的括号、1047. 删除字符串中的所有相邻重复项、150. 逆波兰表达式求值、239. 滑动窗口最大值、347. 前 K 个高频元素)

2021-08-17

微信小程序学习之 实现列表渲染和条件渲染.

生产者/消费者模型

canvas 绘制图片

ssh远程免密登录
随机推荐
Use of vim
oracle池式连接请求超时问题排查步骤
Tutorial (5.0) 08 Fortinet security architecture integration and fortixdr * fortiedr * Fortinet network security expert NSE 5
Algorithm - the K row with the weakest combat power in the matrix (kotlin)
桌面软件开发框架大赏
canvas 绘制图片
Three ways to use applicationcontextinitializer
js单例模式
LeetCode: 377. Combined sum IV
二叉樹第一部分
Record the range of data that MySQL update will lock
简单的价格表样式代码
How does home office manage the data center network infrastructure?
Amazing tips for using live chat to drive business sales
PHP file lock
Oracle database listening file configuration
Arbre binaire partie 1
Groovy obtains Jenkins credentials through withcredentials
记录一下MySql update会锁定哪些范围的数据
Implementation of simple floating frame in WindowManager