当前位置:网站首页>[Part 1 of Cloud Native Monitoring Series] A detailed explanation of Prometheus monitoring system
[Part 1 of Cloud Native Monitoring Series] A detailed explanation of Prometheus monitoring system
2022-07-31 10:22:00 【Steve Lu】
zabbixIs one of the traditional monitoring system,Appear earlier than the cloud native,使用的是SQL关系型数据库;而Prometheus基于谷歌的borgemon使用go语言开发,使用TSDB数据库,So support cloud native.zabbix最新发布的6.0版本,Know we are life and death moment,也支持了Prometheus使用的TSDB数据库.
一、Prometheus 概述
1.1 什么是Prometheus
Prometheus 是一个开源的服务监控系统和时序数据库,其提供了通用的数据模型和快捷数据采集、存储和查询接口.它的核心组件Prometheus serverRegularly monitoring targets from static configuration or service discovery based automatic configuration of the standard of拉取数据,When the new pull the data is greater than the area of the in-memory cache configuration,数据就会持久化到存储设备当中.
每个被监控的主机都可以通过专用的exporter 程序提供输出监控数据的接口,它会在目标处收集监控数据,并暴露出一个HTTP接口供Prometheus server查询,Prometheus通过基于HTTP的pullTo periodically collect data.
Any target surveillance require prior into the monitoring system for time-series data acquisition、存储、The alarm and show,Monitoring the target can be specified through the configuration information to a static form,也可以让PrometheusThrough the service discovery mechanism for dynamic management.
Prometheus 能够直接把API ServerAs the service discovery system using,Then dynamic discovery and monitoring of the cluster all objects can be monitored.
Prometheus 官网地址:https://prometheus.io
Prometheus github 地址:https://github.com/prometheus
1.2 prometheus 的特点:
- 多维数据模型:由度量名称和键值对标识的时间序列数据
时序数据,是在一段时间内通过重复测量(measurement)For the observed value of a collection of;将这些观测值绘制于图形之上,它会有一个数据轴和一个时间轴;
- 内置时间序列(pime series)数据库:Prometheus;The outer distal storage usually with:InfluxDB、openTsDB等
- promQL一种灵活的查询语言,可以利用多维数据完成复杂查询
- 基于HTTP的pull(拉取)方式采集时间序列数据
- 同时支持PushGatewayThe component data collection
- By static configuration or service discovery found target
- Support as a data source accessGrafana
1.3 Prometheus的生态组件
Prometheus 负责时序型指标数据的采集及存储,But data analysis、Aggregation and intuitive display and alarm function is not byPrometheus Server所负责.
(1)Prometheus server:Service core components,采用pullWay to collect monitor data,通过http协议传输.And store the time series data.Prometheus server 由三个部分组成:Retrival,Storage,PromQL
- Retrieval:负责在活跃的target 主机上抓取监控指标数据.
- Storage:存储,主要是把采集到的数据存储到磁盘中.默认为15天(可修改).
- PromQL:是Prometheus提供的查询语言模块.
(2)client Library:客户端库,Aims to those expected原生提供Instrumentation功能的应用程序Provide convenient development way,Used to the built-in measurement system based on the application.
(3)Exporters:指标暴露器,Responsible for the collection does not support built inInstrumentationThe performance of the application or service data,并通过HTTP接口供Prometheus Server获取.换句话说,Exporter Responsible for the collection from the target application and aggregate data original format,And transform or aggregated intoPrometheusFormat is exposed outside index to.
- Node-Exporter:Used to collect the server node(例如k8s)Physical indicators of state data,如平均负载、CPU、内存、磁盘、Index of network information resources such as data,Need to be deployed to all computing nodes.Indicators in detail:https://github.com/prometheus/node_exporter
- mysqld-exporter/nginx-exporter
- Kube-state-Metrics:为prometheus 采集k8s资源数据的exporter,通过监听APIServer 收集kubernetesThe state of an object within the cluster resources data,例如pod、deployment、service 等等.同时它也提供自己的数据,主要是资源采集个数和采集发生的异常次数统计.
需要注意的是kube-state-metrics 只是简单的提供一个metrics 数据,并不会存储这些指标数据,所以可以使用prometheus来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如Deployment、Pod、副本状态等;调度了多少个replicas?现在可用的有几个?多少个Pod是running/stopped/terminated 状态?Pod 重启了多少次?有多少job在运行中. - cAdvisor:Used to monitor the container internal use resources information,比如CPU、内存、网络I/0、磁盘I/0.
- blackbox-exporter:Container monitoring business viability in.
(4)Service Discovery:服务发现,Used for dynamic discovery to monitorTarget,Prometheus支持多种服务发现机制:文件、DNS、Consul、Kubernetes等等.
Service discovery can be through a third party to provide the interface,Prometheus查询到需要监控的Target列表,然后轮询这些Target 获取监控数据.该组件目前由Prometheus Server内建支持
(5)Alertmanager:是一个独立的告警模块,从Prometheus server端接收到“告警通知”后,会进行去重、分组,并路由到相应的接收方,发出报警,常见的接收方式有:电子邮件、钉钉、企业微信等.
Prometheus Server Responsible for generating the alarm indication only,Specific alarm behavior by another independent applicationAlertManager负责;The alarm indicator by the Prometheus ServerBased on the user to provide periodic calculation to generate the alarm rules,Alertmanager 接收到Prometheus ServerSent after the warning instructions,Based on user defined alarm routing to alarm receiver send alarm information.
(6)Pushgateway:Similar to a station,Prometheus的serverClient can only usepull方式拉取数据,But can be used at a certain node for some reasonpushWay to push data,It is used to receivepushThe data and exposed to thePrometheus的serverPull the transit.Can understand as the target host can report data to the short-term tasksPushgateway,然后Prometheus server 统一从Pushgateway拉取数据.
(7)Grafana:是一个跨平台的开源的度量分析和可视化工具,Can be collected data visualization display,And timely notification to the alarm receiver.The official repository has rich dashboard plug-ins.
1.4 Prometheus的工作模式:
- Prometheus Server 基于服务发现(Service Discovery)Mechanism or static configuration access to monitor target(Target),And through the indicators on each targetexporter来采集(Scrape)指标数据;
- Prometheus Server Built in a time series based on file storage to persist the index data,用户可使用PromQL接口来检索数据,Can also according to the alarm requirements must be sent toA1ertmanager完成告警内容发送;
- 一些短期运行的作业的生命周期过短,It is difficult to effectively supply the necessary index data toServer端,They typically use push(Push)Way to output indicator data,Prometheus借助于Pushgateway 接收这些推送的数据,进而由server端进行抓取
1.5 Prometheus的工作流程
(1)Prometheus以prometheus Server 为核心,用于收集和存储时间序列数据.Prometheus ServerFrom the monitoring target throughpullWay to pull data,或通过pushgateway The data collected by pull toPrometheus server中.
(2)Prometheus server The monitoring index of data gathered throughTSDB存储到本地HDD/ssD中.
(3)Prometheus 采集的监控指标数据按时间序列存储,通过配置报警规则,把触发的报警发送到Alertmanager.
(4)Alertmanager 通过配置报警接收方,发送报警到邮件、Nailing or enterprise WeChat, etc.
(5)Prometheus 自带的Web UI 界面提供PromQL 查询语言,可查询监控数据.
(6)Grafana 可接入Prometheus 数据源,把监控数据以图形化形式展示出.
1.6 Prometheus的局限性
PrometheusThe event is a refers to the monitoring and control system,Not suitable for storage events and logs, etc;It is more liable to monitor,而非精准数据;
Prometheus认为只有最近的监控数据才有查询的需要,其本地存储的设计初衷只是保存短期(例如一个月)数据,Therefore do not support for a large number of historical data storage;若需要存储长期的历史数据,建议基于远端存储机制将数据保存于InfluxDB或openTsDB等系统中;
PrometheusThe clustering of maturity is not high,可基于Thanos(And destroy the bully is a word)实现PrometheusThe high availability cluster and the cluster.
mysql、nginx、k8sSuch as using a number of differentPrometheus收集,Form the cluster
2.1 环境准备工作
服务器类型 | IP地址 | 组件 |
Prometheus服务器 | | Prometheus、node_exporter |
grafana服务器 | | Grafana |
被监控服务器 | | node_exporter |
2.2 普罗米修斯的部署
(1)上传 prometheus-2.35.0.linux-amd64.tar.gz 到 /opt 目录中,并解压
#Decompression after upload package
[email protected] opt]# tar xf prometheus-2.35.0.linux-amd64.tar.gz
#Mobile and name
[[email protected] opt]# mv prometheus-2.35.0.linux-amd64 /usr/local/prometheus
[[email protected] opt]# cd /usr/local/prometheus
[[email protected] prometheus]# ls
console_libraries consoles LICENSE NOTICE prometheus prometheus.yml promtool
cat /usr/local/prometheus/prometheus.yml | grep -v "^#"
global: #用于prometheus的全局配置,比如采集间隔,抓取超时时间等
scrape_interval: 15s #The target host monitoring data interval,默认为1m
evaluation_interval: 15s #Trigger the alarm generatedalert的时间间隔,默认是1m
# scrape_timeout is set to the global default (10s).
scrape_timeout: 10s #Data acquisition time out,默认10s
alerting: #用于alertmanager实例的配置,Support the static configuration and dynamic mechanism of service discovery
- static_configs:
- targets:
# - alertmanager:9093
rule_files: #Used to load the alarm rules relevant configuration file path,Can use filename wildcard mechanism
# - "first_rules.yml"
# - "second_rules.yml"
scrape_configs: #Used for acquisition of time-series data source configuration
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus" #The collection of all instances each monitored withjob_name命名,支持静态配置(static_configs)And the dynamic mechanism of service discovery(*_sd_configs)
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs: #静态目标配置,Fixed in atarget拉取数据
- targets: ["localhost:9090"]
(2)Configure the system startup file,启动 Prometheust
cat > /usr/lib/systemd/system/prometheus.service <<'EOF' [Unit] Description=Prometheus Server Documentation=https://prometheus.io After=network.target [Service] Type=simple ExecStart=/usr/local/prometheus/prometheus \ --config.file=/usr/local/prometheus/prometheus.yml \ --storage.tsdb.path=/usr/local/prometheus/data/ \ --storage.tsdb.retention=15d \ --web.enable-lifecycle ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure [Install] WantedBy=multi-user.target EOF
[Unit] #服务单元
Description=Prometheus Server #描述
After=network.target #依赖关系
ExecStart=/usr/local/prometheus/prometheus \
--config.file=/usr/local/prometheus/prometheus.yml \ #配置文件
--storage.tsdb.path=/usr/local/prometheus/data/ \ #数据目录
--storage.tsdb.retention=15d \ #保存时间
--web.enable-lifecycle #开启热加载
ExecReload=/bin/kill -HUP $MAINPID #重载
systemctl start prometheus
systemctl enable prometheus
netstat -natp | grep :9090
浏览器访问: ,访问到 Prometheus 的 Web UI 界面
点击页面的 Status -> Targets,如看到 Target 状态都为 UP,说明 Prometheus Can the normal data collected ,可以看到 Prometheus To their index data
三、部署 Exporters
部署 Node Exporter Monitoring system level indicators
(1)上传 node_exporter-1.3.1.linux-amd64.tar.gz 到 /opt 目录中,并解压
cd /opt/
tar xf node_exporter-1.3.1.linux-amd64.tar.gz
mv node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin
cat > /usr/lib/systemd/system/node_exporter.service <<'EOF' [Unit] Description=node_exporter Documentation=https://prometheus.io/ After=network.target [Service] Type=simple ExecStart=/usr/local/bin/node_exporter \ --collector.ntp \ --collector.mountstats \ --collector.systemd \ --collector.tcpstat ExecReload=/bin/kill -HUP $MAINPID Restart=on-failure [Install] WantedBy=multi-user.target EOF
systemctl start node_exporter
systemctl enable node_exporter
netstat -natp | grep :9100
浏览器访问: ,可以看到 Node Exporter Collected data
The normal index:
- node_cpu_seconds_total
- node_memory_MemTotal_bytes
- node_filesystem_size_bytes{mount_point=PATH}
- node_system_unit_state{name=}
- node_vmstat_pswpin:系统每秒从磁盘读到内存的字节数
- node_vmstat_pswpout:System from the number of bytes of memory to disk every second
More index is introduced:https://github.com/prometheus/node_exporter
(4)修改 prometheus 配置文件,加入到 prometheus 监控中
vim /usr/local/prometheus/prometheus.yml
- job_name: nodes
metrics_path: "/metrics"
- targets:
service: kubernetes
curl -X POST #热加载
或systemctl reload prometheus
浏览器查看 Prometheus 页面的 Status -> Targets
#使用yum解决依赖关系 My side direct uploading a package toopt
yum install -y grafana-7.4.0-1.x86_64.rpm
systemctl start grafana-server
systemctl enable grafana-server
netstat -natp | grep :3000
浏览器访问: ,默认账号和密码为 admin/admin
Configuration -> Data Sources -> Add data source -> 选择 Prometheus
HTTP -> URL 输入
点击 Save & Test
点击 上方菜单 Dashboards,Import All the default template
Dashboards -> Manage ,选择 Prometheus 2.0 Stats 或 Prometheus Stats 即可看到 Prometheus job Instance of the monitoring image
(3)导入 grafana 监控面板
浏览器访问:https://grafana.com/grafana/dashboards ,在页面中搜索 node exporter ,Select suitable panel,点击 Copy ID 或者 Download JSON
在 grafana 页面中,+ Create -> Import ,输入面板 ID 号或者上传 JSON 文件,点击 Load,Can import controls
5.1 基于文件的服务发现
The file based service discovery is only slightly better than the static configuration mode of the service discovery,它不依赖于任何平台或第三方服务,因而也是最为简单和通用的实现方式.
Prometheus Server Periodically loaded from file Target 信息,文件可使用 YAML 和 JSON 格式,它含有定义的 Target 列表,以及可选的标签信息.
(1)Create files for service discovery,In the file configuration needed target
cd /usr/local/prometheus
mkdir targets
vim targets/node-exporter.yaml
- targets:
app: node-exporter
job: node
#修改 prometheus 配置文件,发现 target 的配置,定义在配置文件的 job 之中
vim /usr/local/prometheus/prometheus.yml
- job_name: nodes
file_sd_configs: #Use the specified file service discovery
- files: #指定要加载的文件列表
- targets/node*.yaml #File loading support wildcards
refresh_interval: 2m #每隔 2 Minutes to load a file defined in the Targets,默认为 5m
systemctl reload prometheus
浏览器查看 Prometheus 页面的 Status -> Targets
前提是该nodeNode is installednode-exporter组件,This step will not show in front,可以使用scpCommand from the Prometheus loom over the past
5.2 基于 Consul 的服务发现
Consul 是一款基于 golang 开发的开源工具,主要面向分布式,服务化的系统提供服务注册、服务发现和配置管理的功能.
(1)部署 Consul 服务
cd /opt/
unzip consul_1.9.2_linux_amd64.zip
mv consul /usr/local/bin/
#创建 Consul Service data and configuration directory
mkdir /var/lib/consul-data
mkdir /etc/consul/
#使用 server 模式启动 Consul 服务
consul agent \
-server \
-bootstrap \
-ui \
-data-dir=/var/lib/consul-data \
-config-dir=/etc/consul/ \
-bind= \
-client= \
-node=consul-server01 &> /var/log/consul.log &
#查看 consul 集群成员
consul members
(2)在 Consul 上注册 Services
#In the configuration directory add file
vim /etc/consul/nodes.json
"services": [
"id": "node_exporter-node01",
"name": "node01",
"address": "",
"port": 9100,
"tags": ["nodes"],
"checks": [{
"http": "",
"interval": "5s"
"id": "node_exporter-node02",
"name": "node02",
"address": "",
"port": 9100,
"tags": ["nodes"],
"checks": [{
"http": "",
"interval": "5s"
#让 consul 重新加载配置信息
consul reload
同样134Machines need to configure thenode-exporter,这边不展示
(3)修改 prometheus 配置文件
vim /usr/local/prometheus/prometheus.yml
- job_name: nodes
consul_sd_configs: #指定使用 consul 服务发现
- server: #指定 consul The service endpoint list
tags: #指定 consul 服务发现的 services 中哪些 service Added to prometheus Monitoring the label
- nodes
refresh_interval: 2m
systemctl reload prometheus
浏览器查看 Prometheus 页面的 Status -> Targets
#让 consul 注销 Service
consul services deregister -id="node_exporter-node02"
consul services register /etc/consul/nodes.json
5.3 基于 Kubernetes API 的服务发现
Content is more detailed late to write a
prometheus定义: 监控系统、时间序列数据库
- prometheus server(http PULLWay of data collection,TSDB数据库存储,alter告警生成)
- client libray(客户端库,Make application service native supportprometheus监控数据采集)
- exporter(指标暴露器,Used to collect native does not supportprometheusMonitoring system and the application of the data exposed to prometheus)
- altermanger(接收prometheus serverPush the alarm information,The alarm routing sent to the recipient is responsible for)
- pushgateway(Receive some short-term tasks push monitoring data,并临时存储,再由prometheus server统一拉取)
- grafana(External monitoring data display platform,使用promQL查询 prometheus 数据源)
- service discovery(动态服务发现机制,支持文件、consul、K8S、DNS等方式)
prometheus远程存储: InfluxDB、openTSDB
- 解决报错TypeError:unsupported operand type(s) for +: ‘NoneType‘ and ‘str‘
- P5231 [JSOI2012]玄武密码(SAM 经典运用)
- 【LeetCode】Day108-和为 K 的子数组
- 透过开发抽奖小程序,体会创新与迭代
- (C语言)程序环境和预处理
- How SQL intercepts specified characters from strings (three functions of LEFT, MID, RIGHT)
- 怎样使用浏览器静默打印网页
- 小程序如何使用订阅消息(PHP代码+小程序js代码)
- Redis的简单使用
- 【LeetCode】36.有效的数独
Redis Cluster - Sentinel Mode Principle (Sentinel)
The big-eyed Google Chrome has also betrayed, teach you a trick to quickly clear its own ads
Redis Sentinel原理
A Method for Ensuring Data Consistency of Multi-Party Subsystems
“chmod 777-R 文件名”什么意思?
Day113. Shangyitong: user authentication, Alibaba Cloud OSS, patient management
【LeetCode】21. 合并两个有序链表
[ verb phrase ] collection
Emotional crisis, my friend's online dating girlfriend wants to break up with him, ask me what to do
Three ways of single sign-on
【LeetCode】242. 有效的字母异位词
Source code analysis of GZIPInputStream class
Use turtle to draw buttons