当前位置:网站首页>How large and medium-sized enterprises build their own monitoring system
How large and medium-sized enterprises build their own monitoring system
2022-06-24 10:05:00 【51CTO】
In recent years, I have been responsible for the construction of the company's monitoring system , from 0-1 Build a distributed monitoring system for the whole company . Before that 51cto Blogs also post a lot about Zabbix Distributed monitoring related articles , Many articles are highly praised and read , A great sense of accomplishment . Everybody knows , Internet technology is updated very quickly , In recent years docker、k8s、 Microservices and CICD It is the mainstream technology of each company . Various applications 、 Microservices and docker And so on , Business invocation and fault diagnosis will be more complicated . How to choose and use various excellent monitoring software to solve practical problems , Let monitoring go ahead of the business , To serve the business , It is a problem that every O & M and even leaders should think about . Perfect monitoring system can stop loss in time , Give play to the value of operation and maintenance personnel , The importance of system monitoring . A complete set of monitoring system has been summarized and used in many years of practice , It mainly includes host monitoring zabbix、 Log monitoring graylog、 Monitoring of microservices prometheus、 Business monitoring skywalking、 A third party APM Dial test and centralized display tools grafana platform . It has formed a set of monitoring systems applied by large Internet companies .
1)Zabbix Distributed monitoring system : The system starts from the initial 2.2.1->3.2.0->4.0->5.0 Up to now 5.0 The version is upgraded smoothly and iteratively . Whether in function 、 Excellent performance and stability . In recent years, both hardware 、 System 、 The Internet 、 Storage is still a problem with middleware .Zabbix Monitoring can give timely and accurate alarm , It has played a great value for the stable operation of the company's business . At present, the whole station equipment monitoring has been realized . In terms of server hardware monitoring HP Management tools +SNMPget+Zabbix Automatic discovery function to realize the server hard disk 、 Power Supply 、 Fan and ambient temperature monitoring , Our monitoring scheme is also the best solution in the industry . In the core machine 、 In terms of storage and other equipment monitoring, we pass SNMPTrap The function is realized , comparison SNMPget Acquire data regularly ,snmpTrap More advantages , The alarm will be triggered immediately in case of any question , It is more suitable for the monitoring of core equipment . In the aspect of middleware monitoring , We collected the most representative performance indicators of the service and controlled the customized alarm conditions according to our years of operation and maintenance experience . In terms of automatic monitoring , Through a lot of python|Shell The program writes many automatic scanning programs and automatically adds them to the monitoring system , For example, the whole station url Availability monitoring 、http Certificate monitoring 、 Middleware multi instance monitoring, etc , Reduce human error and workload . In monitoring data storage and performance optimization , By automating the day-to-day partitioning MySQL Performance issues , By way of MySQL Innodb The storage engine is converted to Tokudb Storage engine to reduce the use of disk space , Through the conversion of this engine, you can save 80% Left and right hard disk space , Solve the problem of data storage . In terms of monitoring and alarm storm control , adopt Python The program pushes the alarm message to redis cache , Then take the weight according to the conditions 、 Combine and combine the information in the asset management system to alarm . The control function of warning storm, which is the most difficult problem of the monitoring system, is realized . In the aspect of improving the distributed monitoring system , Always insist on finding problems -> Location problem -> Summary questions -> Whether to optimize the idea of monitoring , Basic familiarity zabbix Official website , And accumulated rich practical experience .
2)Graylog Log collection system : although Zabbix Support log monitoring , Because in the data volume 、 Search and log display are relatively weak , Only simple log alarms can be made . Therefore, log monitoring still needs to be done through professional tools .Graylog An open source log aggregation 、 analysis 、 Audit 、 Presentation and early warning tools . comparison ELK,Graylog A lightweight ,UI The interface is more beautiful , There are abundant and perfect API Interface . By looking at Dashboards The report can confirm whether there are problems with thousands of online devices . Currently, the logs of network devices are collected 、MySQL Error log 、Linux System logs, etc . Various logs are statistically analyzed according to the error level , The corresponding advanced log passes Graylog +Python The program realizes the alarm function of wechat and e-mail . In several online businesses MQ System kernel crash 、 The system file system is corrupt 、CPU In case of problems such as soft lock and power module of network equipment, the relevant personnel shall be warned in time through log alarm .
3)Skywalking Full link service monitoring system : There is no need to develop or modify the source code for access monitoring , Just introduce skywalking Of jar Bag can . With the promotion of the company's Micro Services , Business invocation and fault diagnosis will be more complicated .skywalking Monitoring is mainly used to monitor the user request link and path ( Topology ), It can track whether each link of the calling link is normal ( Reason for the error ) And time consuming (DB Inquire about 、 Cache queries, etc ), It is mainly used for business level performance optimization and fault diagnosis . The alarms of different projects are customized , And push the alarm to the relevant responsible person , adopt Python The program realizes the dual channel alarm of e-mail and wechat .
4)Prometheus The monitoring system : comparison zabbix system ,prometheus The monitoring of microservices has more advantages , and Zabbix The monitoring system , Learn from others' strong points and close the gap , Give full play to their advantages . And actively communicate with developers and promote the company's Micro service monitoring , Realized prometheus adopt nacos The registry automatically obtains the registered microservices , It mainly monitors JVM Load at the system level 、 Detailed heap memory 、 Connection pool performance indicators 、 Each of the following micro services URL The number of calls and the status code returned , Statistics on the number of error logs of microservices . Cooperate with the third party granfana Large screen display , It is mainly used for system level fault diagnosis and performance tuning .
5)Grafana Front end unified display platform :grafana Is an open source professional data display tool , Beautiful interface , To use . at present Grafana It's connected Zabbix System 、Permontheus、MySQL The data in it . It well complements the deficiency of monitoring in data display . Passed before grafana It is particularly slow to obtain monitoring data for more than one day , adopt google To optimize the grafana Parameters , Getting data now is especially fast , It is especially convenient for viewing data according to the whole channel 、 To use .
6) The third party APM Dial test monitoring :APM The dial-up test mainly uses... Provided by a third party LM Monitor website performance and availability across the country . It is also used for the evaluation and selection of third-party services , such as IDC The computer room and CDN Model selection, etc , In evaluation CDN We mainly focus on the overall performance and availability of image opening , In evaluation IDC The computer room mainly evaluates the download speed 、 Usability 、 And network packet loss rate and delay .
summary : In many years of operation and maintenance monitoring work , I've had all kinds of problems . Each fault will start from the symptom of the fault 、 reason 、 How to check 、 How to solve 、 How to improve the idea of monitoring . Have a deep research on the key performance indicators of the system and business . For example, mention the performance bottleneck of the system ,redis Cluster performance bottlenecks 、 The performance bottleneck of the message queue will immediately think of the key indicators and conduct troubleshooting through system commands and monitoring . I will continue to learn later k8s Relevant knowledge , Complete monitoring of the whole business life cycle of the company ( The underlying hardware 、 System 、 The Internet 、 middleware 、 Microservices 、 Business, etc ). Continuously improve their ability to analyze and solve problems .
边栏推荐
- How to improve the efficiency of network infrastructure troubleshooting and bid farewell to data blackouts?
- Endgame P.O.O
- 小程序学习之获取用户信息(getUserProfile and getUserInfo)
- 英伟达这篇CVPR 2022 Oral火了!2D图像秒变逼真3D物体!虚拟爵士乐队来了!
- Idea cannot save settings source root d:xxxx is duplicated in module XXX
- [custom endpoint and implementation principle]
- Cookie encryption 4 RPC method determines cookie encryption
- 413-二叉树基础
- About thinkphp5, use the model save() to update the data prompt method not exist:think\db\query- & gt; Error reporting solution
- Queue queue
猜你喜欢
随机推荐
如何管理海量的网络基础设施?
[Eureka source code analysis]
Practical analysis: implementation principle of APP scanning code landing (app+ detailed logic on the web side) with source code
Tnsnames Ora file configuration
Floating point notation (summarized from cs61c and CMU CSAPP)
Cookie encryption 4 RPC method determines cookie encryption
2021-08-17
分布式 | 如何与 DBLE 进行“秘密通话”
算法--找到和最大的长度为 K 的子序列(Kotlin)
linux下oracle服务器打开允许远程连接
请问有国内靠谱低手续费的期货开户渠道吗?网上开户安全吗?
Recursive traversal of 414 binary tree
Internet of things? Come and see Arduino on the cloud
Servlet快速筑基
为什么 JSX 语法这么香?
小程序学习之获取用户信息(getUserProfile and getUserInfo)
二叉树第一部分
Array seamless scrolling demo
oracle池式连接请求超时问题排查步骤
JCIM|药物发现中基于AI的蛋白质结构预测:影响和挑战









