当前位置:网站首页>[Prometheus] An optimization record of the Prometheus federation [continued]
[Prometheus] An optimization record of the Prometheus federation [continued]
2022-07-30 18:33:00 【Meepoljd】
前言
It's been sorted out beforePrometheusOptimization record for federated clusters,A discard for useless indicators,To a certain extent, the data pull pressure of query nodes is linked,But when the index is large enough,Or after collecting enough endpoints,This method is a bit clumsy;So grouping the metrics becomes the next optimization method,在此记录一下.
Refer to the previous article for masking of non-essential metrics【Prometheus】Prometheus联邦的一次优化记录
正文
服务器规划
First explain the current environment of the threePrometheusNode planning,IP经过处理:
服务器IP | 服务器类型 | CPU | 内存 |
---|---|---|---|
10.0.0.69 | 采集Prometheus | 64 | 256 |
10.0.0.70 | 采集Prometheus | 64 | 256 |
10.0.0.71 | 汇聚Prometheus | 64 | 256 |
其中采集PrometheusThe function is to pull data from a specific collection endpoint,如node_exporter;汇聚PrometheusResponsible for collecting from eachPrometheusMetrics collected by node periodic aggregation;
分析过程
after the last optimization,The monitored collection endpoints continue to increase,在前几天,There is a collectionPrometheusBreakpoints in metric ingestion due to too long response time began to occur frequently again:
Check on the corresponding server,The resource usage of the host is not high,PrometheusThe process does not take up too many resources,Exclude collectionPrometheusMetric collection exceptions caused by resource bottlenecks,This node has collected the necessary metrics from the host,Then the suspicion is still the queryPrometheusIt is caused by a timeout when the node aggregates metrics;
The configuration after the last optimization modification is as follows:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"node_.*|up.*"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
The required metrics are currently screened,Therefore, there is no way to reduce the total amount of index collection,It is possible to consider the method of splitting the indicators transmitted in large batches for aggregation,That is, two collections would have been aggregated separatelyPrometheusFull monitoring indicators of the node(Of course in this example only ingestionnode_开头的和up开头的指标)
Group intake
Because the collected host monitoring indicators all existinstance标签,The operation of grouping and pulling indicators can be performed through network segments,In this way, each pull action will not pull a huge amount of indicators,Instead, it is broken down into smaller pull actions,具体操作如下:
# 第一组
- job_name: 'federate_0'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.4开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.4.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第二组
- job_name: 'federate_1'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.6开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.6.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第三组
- job_name: 'federate_2'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.7/8开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.7.*9100|10.0.8.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第四组
- job_name: 'federate_3'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.30开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.30.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
然后保存配置,并重载Prometheus服务;Observe the index intake again,Acquisition breakpoints no longer appear:
Look at the intake time,This time produces a very large optimization effect:
小结
在Prometheuswhen collecting indicators,Either a federated or a single-node approach,It is necessary to reduce data ingestion at each metric ingestion endpoint as much as possible,In this way, sufficient delay requirements can be met,Otherwise, network transmission will consume a lot of data pulling time,Causes a breakpoint on the monitored metric.
边栏推荐
- 载誉而归,重磅发布!润和软件亮相2022开放原子全球开源峰会
- After 23 years of operation, the former "China's largest e-commerce website" has turned yellow...
- 3D机器视觉厂商的场景争夺战役
- (2022杭电多校四)1001-Link with Bracket Sequence II(区间动态规划)
- 【HarmonyOS】【ARK UI】HarmonyOS ets语言怎么实现双击返回键退出
- DTSE Tech Talk丨第2期:1小时深度解读SaaS应用系统设计
- 【HMS Core】【FAQ】运动健康、音频编辑、华为帐号服务 典型问题合集7
- 设计消息队列存储消息数据的 MySQL 表格
- 【Pointing to Offer】Pointing to Offer 18. Delete the node of the linked list
- 【开发者必看】【push kit】推送服务典型问题合集3
猜你喜欢
你好好想想,你真的需要配置中心吗?
【剑指 Offe】剑指 Offer 17. 打印从1到最大的n位数
银行适用:此文能够突破你的运维流程管理问题
After 23 years of operation, the former "China's largest e-commerce website" has turned yellow...
【AGC】构建服务1-云函数示例
One year after graduation, I was engaged in software testing and won 11.5k. I didn't lose face to the post-98 generation...
基础架构之Mongo
C# wpf 无边框窗口添加阴影效果
MySQL数据类型
SwiftUI iOS 精品开源项目之 完整烘焙食品菜谱App基于SQLite(教程含源码)
随机推荐
CCNA-ACL(访问控制列表)标准ACL 扩展ACL 命名ACL
AWS console
【Prometheus】Prometheus联邦的一次优化记录[续]
Codeblocks + Widgets 创建窗口代码分析
攻防世界web-Cat
Multiple instances of mysql
DM8:单库单实例搭建本地数据守护服务
载誉而归,重磅发布!润和软件亮相2022开放原子全球开源峰会
AI基础:图解Transformer
好未来单季营收2.24亿美元:同比降84% 张邦鑫持股26.3%
The Meta metaverse division lost 2.8 billion in the second quarter!Still want to keep betting?Metaverse development has yet to see a way out!
Web结题报告
NC | 西湖大学陶亮组-TMPRSS2“助攻”病毒感染并介导索氏梭菌出血毒素的宿主入侵...
MYSQL(基本篇)——一篇文章带你走进MYSQL的奇妙世界
【Swords Offer】Swords Offer 17. Print n digits from 1 to the largest
「Redis应用与深度实践笔记」,深得行业人的心,这还不来看看?
CMake库搜索函数居然不搜索LD_LIBRARY_PATH
第十六期八股文巴拉巴拉说(MQ篇)
The sixteenth issue of eight-part article Balabala said (MQ)
ByteArrayInputStream class source code analysis