当前位置:网站首页>[Prometheus] An optimization record of the Prometheus federation [continued]
[Prometheus] An optimization record of the Prometheus federation [continued]
2022-07-30 18:33:00 【Meepoljd】
前言
It's been sorted out beforePrometheusOptimization record for federated clusters,A discard for useless indicators,To a certain extent, the data pull pressure of query nodes is linked,But when the index is large enough,Or after collecting enough endpoints,This method is a bit clumsy;So grouping the metrics becomes the next optimization method,在此记录一下.
Refer to the previous article for masking of non-essential metrics【Prometheus】Prometheus联邦的一次优化记录
正文
服务器规划
First explain the current environment of the threePrometheusNode planning,IP经过处理:
| 服务器IP | 服务器类型 | CPU | 内存 |
|---|---|---|---|
| 10.0.0.69 | 采集Prometheus | 64 | 256 |
| 10.0.0.70 | 采集Prometheus | 64 | 256 |
| 10.0.0.71 | 汇聚Prometheus | 64 | 256 |
其中采集PrometheusThe function is to pull data from a specific collection endpoint,如node_exporter;汇聚PrometheusResponsible for collecting from eachPrometheusMetrics collected by node periodic aggregation;
分析过程
after the last optimization,The monitored collection endpoints continue to increase,在前几天,There is a collectionPrometheusBreakpoints in metric ingestion due to too long response time began to occur frequently again:
Check on the corresponding server,The resource usage of the host is not high,PrometheusThe process does not take up too many resources,Exclude collectionPrometheusMetric collection exceptions caused by resource bottlenecks,This node has collected the necessary metrics from the host,Then the suspicion is still the queryPrometheusIt is caused by a timeout when the node aggregates metrics;
The configuration after the last optimization modification is as follows:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"node_.*|up.*"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
The required metrics are currently screened,Therefore, there is no way to reduce the total amount of index collection,It is possible to consider the method of splitting the indicators transmitted in large batches for aggregation,That is, two collections would have been aggregated separatelyPrometheusFull monitoring indicators of the node(Of course in this example only ingestionnode_开头的和up开头的指标)
Group intake
Because the collected host monitoring indicators all existinstance标签,The operation of grouping and pulling indicators can be performed through network segments,In this way, each pull action will not pull a huge amount of indicators,Instead, it is broken down into smaller pull actions,具体操作如下:
# 第一组
- job_name: 'federate_0'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.4开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.4.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第二组
- job_name: 'federate_1'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.6开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.6.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第三组
- job_name: 'federate_2'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.7/8开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.7.*9100|10.0.8.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第四组
- job_name: 'federate_3'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.30开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.30.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
然后保存配置,并重载Prometheus服务;Observe the index intake again,Acquisition breakpoints no longer appear:
Look at the intake time,This time produces a very large optimization effect:
小结
在Prometheuswhen collecting indicators,Either a federated or a single-node approach,It is necessary to reduce data ingestion at each metric ingestion endpoint as much as possible,In this way, sufficient delay requirements can be met,Otherwise, network transmission will consume a lot of data pulling time,Causes a breakpoint on the monitored metric.
边栏推荐
猜你喜欢

基础架构之Redis

【HMS core】【FAQ】Account Kit、MDM能力、push Kit典型问题合集6

攻防世界web-Cat

卫星电话是直接与卫星通信还是通过地面站?

【HMS core】【ML Kit】机器学习服务常见问题FAQ(二)

C# wpf 无边框窗口添加阴影效果

ESP8266-Arduino programming example-HC-SR04 ultrasonic sensor driver

载誉而归,重磅发布!润和软件亮相2022开放原子全球开源峰会

你好好想想,你真的需要配置中心吗?

The large-scale application of artificial intelligence AI products in industrial-grade mature shipping ports of CIMC World Lianda will create a new generation of high-efficiency smart ports and innova
随机推荐
基于b/s架构搭建一个支持多路摄像头的实时处理系统 ---- 使用yolo v5 系列模型
ByteArrayInputStream 类源码分析
博纳影通过IPO注册:阿里腾讯是股东 受疫情冲击明显
cocos creater 热更重启导致崩溃
卫星电话是直接与卫星通信还是通过地面站?
Mysql执行原理剖析
CCNA-NAT协议(理论与实验练习)
OSPF详解(4)
时序数据库在船舶风险管理领域的应用
中集世联达飞瞳全球工业人工智能AI领军者,全球顶尖AI核心技术高泛化性高鲁棒性稀疏样本持续学习,工业级高性能成熟AI产品规模应用
怎么样的框架对于开发者是友好的?
运营 23 年,昔日“国内第一大电商网站”黄了...
Confluence OGNL注入漏洞复现(CVE-2022-26134)
DevEco Studio3.0下载失败,提示An unknown error occurred
【剑指 Offe】剑指 Offer 18. 删除链表的节点
Application of time series database in the field of ship risk management
【HMS core】【Analytics Kit】【FAQ】如何解决华为分析付费分析中付款金额显示为0的问题?
生物医学论文有何价值 论文中译英怎样翻译效果好
你好,我的新名字叫“铜锁/Tongsuo”
Common linked list problems and their Go implementation