当前位置:网站首页>[Prometheus] An optimization record of the Prometheus federation [continued]
[Prometheus] An optimization record of the Prometheus federation [continued]
2022-07-30 18:33:00 【Meepoljd】
前言
It's been sorted out beforePrometheusOptimization record for federated clusters,A discard for useless indicators,To a certain extent, the data pull pressure of query nodes is linked,But when the index is large enough,Or after collecting enough endpoints,This method is a bit clumsy;So grouping the metrics becomes the next optimization method,在此记录一下.
Refer to the previous article for masking of non-essential metrics【Prometheus】Prometheus联邦的一次优化记录
正文
服务器规划
First explain the current environment of the threePrometheusNode planning,IP经过处理:
| 服务器IP | 服务器类型 | CPU | 内存 |
|---|---|---|---|
| 10.0.0.69 | 采集Prometheus | 64 | 256 |
| 10.0.0.70 | 采集Prometheus | 64 | 256 |
| 10.0.0.71 | 汇聚Prometheus | 64 | 256 |
其中采集PrometheusThe function is to pull data from a specific collection endpoint,如node_exporter;汇聚PrometheusResponsible for collecting from eachPrometheusMetrics collected by node periodic aggregation;
分析过程
after the last optimization,The monitored collection endpoints continue to increase,在前几天,There is a collectionPrometheusBreakpoints in metric ingestion due to too long response time began to occur frequently again:
Check on the corresponding server,The resource usage of the host is not high,PrometheusThe process does not take up too many resources,Exclude collectionPrometheusMetric collection exceptions caused by resource bottlenecks,This node has collected the necessary metrics from the host,Then the suspicion is still the queryPrometheusIt is caused by a timeout when the node aggregates metrics;
The configuration after the last optimization modification is as follows:
- job_name: 'federate'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"node_.*|up.*"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
The required metrics are currently screened,Therefore, there is no way to reduce the total amount of index collection,It is possible to consider the method of splitting the indicators transmitted in large batches for aggregation,That is, two collections would have been aggregated separatelyPrometheusFull monitoring indicators of the node(Of course in this example only ingestionnode_开头的和up开头的指标)
Group intake
Because the collected host monitoring indicators all existinstance标签,The operation of grouping and pulling indicators can be performed through network segments,In this way, each pull action will not pull a huge amount of indicators,Instead, it is broken down into smaller pull actions,具体操作如下:
# 第一组
- job_name: 'federate_0'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.4开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.4.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第二组
- job_name: 'federate_1'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.6开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.6.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第三组
- job_name: 'federate_2'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.0.7/8开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.0.7.*9100|10.0.8.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
# 第四组
- job_name: 'federate_3'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# 负责拉取10.30开头的IP的服务器指标
- '{__name__=~"node_.*|up.*",instance=~"10.30.*9100"}'
static_configs:
- targets:
- '10.0.0.69:9090'
- '10.0.0.70:9090'
labels:
cluster: XXXX系统
tls_config:
insecure_skip_verify: true
然后保存配置,并重载Prometheus服务;Observe the index intake again,Acquisition breakpoints no longer appear:
Look at the intake time,This time produces a very large optimization effect:
小结
在Prometheuswhen collecting indicators,Either a federated or a single-node approach,It is necessary to reduce data ingestion at each metric ingestion endpoint as much as possible,In this way, sufficient delay requirements can be met,Otherwise, network transmission will consume a lot of data pulling time,Causes a breakpoint on the monitored metric.
边栏推荐
- 【HMS core】【FAQ】Account Kit、MDM能力、push Kit典型问题合集6
- 强啊,点赞业务缓存设计优化探索之路。
- MySql中@符号的使用
- 攻防世界web-Cat
- 【PHPWord】Quick Start of PHPWord in PHPOffice Suite
- Go 系统收集
- NC | Tao Liang Group of West Lake University - TMPRSS2 "assists" virus infection and mediates the host invasion of Clostridium sothrix hemorrhagic toxin...
- cocos creater 热更重启导致崩溃
- 【HarmonyOS】【FAQ】鸿蒙问题合集4
- ESP8266-Arduino编程实例-HC-SR04超声波传感器驱动
猜你喜欢

One year after graduation, I was engaged in software testing and won 11.5k. I didn't lose face to the post-98 generation...

固定资产可视化智能管理系统

线性筛求积性函数

ESP8266-Arduino编程实例-DS18B20温度传感器驱动

【总结】1396- 60+个 VSCode 插件,打造好用的编辑器

【每日一道LeetCode】——191. 位1的个数

scrapy基本使用

core sound driver详解

natural language processing nltk

Node encapsulates a console progress bar plugin
随机推荐
Go 系统收集
Mysql execution principle analysis
卫星电话是直接与卫星通信还是通过地面站?
scrapy基本使用
SwiftUI iOS Boutique Open Source Project Complete Baked Food Recipe App based on SQLite (tutorial including source code)
中集世联达飞瞳全球工业人工智能AI领军者,全球顶尖AI核心技术高泛化性高鲁棒性稀疏样本持续学习,工业级高性能成熟AI产品规模应用
requet.getHeader(“token“) 为null
第4章 控制执行流程
WeChat Mini Program Cloud Development | Urban Information Management
Scrapy框架介绍
AI基础:图解Transformer
OSPF详解(3)
载誉而归,重磅发布!润和软件亮相2022开放原子全球开源峰会
kotlin by lazy
432.4 FPS 快STDC 2.84倍 | LPS-Net 结合内存、FLOPs、CUDA实现超快语义分割模型
Immersive experience iFLYTEK 2022 Consumer Expo "Official Designated Product"
智慧中控屏
Hello, my new name is "Bronze Lock/Tongsuo"
AWS 控制台
[OC study notes] attribute keyword