上一篇没写完那天着急下班哈哈哈,继续写下去。
我们之前已经修改了prometheus的yml文件并且在rules目录添加了磁盘告警规则,也在monitor-config里添加了被监控的机器标签,完整的结构应该是如下这样。
[root@bigdata01 prometheus]# ll #prometheus内部结构
总用量 161708
drwxr-xr-x 2 root root 38 12月 4 22:02 console_libraries
drwxr-xr-x 2 root root 173 12月 4 22:02 consoles
drwxr-xr-x 11 root root 308 12月 10 09:00 data
-rw-r--r-- 1 root root 11357 12月 4 22:02 LICENSE
drwxr-xr-x 3 root root 22 12月 8 15:00 monitor-config #存放被监控机器的标签位置
-rw-r--r-- 1 root root 3420 12月 4 22:02 NOTICE
-rwxr-xr-x 1 root root 87758460 12月 4 22:02 prometheus
-rw-r--r-- 1 root root 1447 12月 9 15:48 prometheus.yml
-rwxr-xr-x 1 root root 77805320 12月 4 22:02 promtool
drwxr-xr-x 3 root root 18 12月 9 15:45 rules #存放告警规则的位置
[root@bigdata01 prometheus]# pwd
/opt/monitor/prometheus
[root@bigdata01 prometheus]# ll monitor-config/A-getway/
总用量 8
-rw-r--r-- 1 root root 82 12月 9 09:36 192.168.1.1.yml
-rw-r--r-- 1 root root 82 12月 8 15:03 192.168.1.2.yml
[root@bigdata01 prometheus]# ll rules/host/
总用量 4
-rw-r--r-- 1 root root 649 12月 9 15:43 disk_use.yml
[root@bigdata01 prometheus]# cat prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['192.168.1.1:9093']
# - alertmanager:9093
######################################################################################################
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/opt/monitor/prometheus/rules/host/*.yml"
#####################################################################################################
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['192.168.1.1:9090']
#####################################################################################################
- job_name: 'A-getway'
file_sd_configs:
- files: ['/opt/monitor/prometheus/monitor-config/A-getway/*.yml']
refresh_interval: 5s
每次修改prometheus都需要热加载一下,这里需要个启动参数--web.enable-lifecycle
我平时的启动命令如下:
/opt/monitor/prometheus/prometheus --config.file="/opt/monitor/prometheus/prometheus.yml"> /opt/monitor/prometheus/prometheus.log --web.enable-lifecycle 2>&1 &
然后修改了配置文件就从新reload一下
curl -XPOST http://192.168.1.1:9090/-/reload
查看monitor-config里的文件内容
[root@bigdata01 prometheus]# cat monitor-config/A-getway/192.168.1.2.yml
- targets: [ "192.168.1.2:9275" ]
labels:
group: "A"
kind: "A-cloudera"
查看rules内的disk告警内容
[root@bigdata01 prometheus]# cat rules/host/disk_use.yml
groups:
- name: host_disk
rules:
- alert: NodediskUsage
expr: round(disk_used_percent{kind="jkj"}) >= 89
for: 1m
labels:
sort: host_disk
level: severity
annotations:
summary: "{{$labels.instance}}: High disk usage"
description: "disk {{$labels.path}} already use {{ $value }}%,please check it"
- alert: NodediskUsagea_gx
expr: round(disk_used_percent{kind="gx"}) > 85
for: 1m
labels:
sort: host_disk
level: severity
annotations:
summary: "{{$labels.instance}}: High disk usage"
description: "disk {{ $labels.path }} already use {{ $value }}%,please check it"
去prometheus验证,点击status--rules就可以看到我们配置的磁盘告警规则
点击status--targets就可以看到我们配置的被监控机器
可以看到上图192.168.1.1机器我配置好后没有启动telegraf操作,目前报错链接被拒绝。
我在monitor-config下创建B-getway并且创建192.168.1.1.yml文件,修改prometheus的配置文件prometheus.yml增加job
#####################################################################################################
- job_name: 'A-getway'
file_sd_configs:
- files: ['/opt/monitor/prometheus/monitor-config/A-getway/*.yml']
refresh_interval: 5s
- job_name: 'B-getway'
file_sd_configs:
- files: ['/opt/monitor/prometheus/monitor-config/B-getway/*.yml']
refresh_interval: 5s
热加载一下
curl -XPOST http://192.168.1.1:9090/-/reload
去页面看下,ok完成了。
接下来在192.168.1.1上配置telegraf并启动就可以了。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
扫码关注腾讯云开发者
领取腾讯云代金券
Copyright © 2013 - 2025 Tencent Cloud. All Rights Reserved. 腾讯云 版权所有
深圳市腾讯计算机系统有限公司 ICP备案/许可证号:粤B2-20090059 深公网安备号 44030502008569
腾讯云计算(北京)有限责任公司 京ICP证150476号 | 京ICP备11018762号 | 京公网安备号11010802020287
Copyright © 2013 - 2025 Tencent Cloud.
All Rights Reserved. 腾讯云 版权所有