1、alertmanager部署
prometheus可以进行告警配置,会发送告警请求,该请求发送至alertmanager,由alertmanager来将相关信息发送给监控管理员
创建docker-compose.yml文件
alertmanager:
image: prom/alertmanager:latest
ports:
- 9093:9093
networks:
- monitor_overlay
volumes:
- "/etc/localtime:/etc/localtime:ro"
- "$WORK_HOME_ALERTMANAGER/config/alertmanager.yml:/etc/alertmanager/alertmanager.yml"
deploy:
placement:
constraints: [node.hostname==node150]
restart_policy:
condition: any
delay: 5s
max_attempts: 3
networks:
monitor_overlay:
external: true
启动服务
docker stack deploy -c docker-compose.yml 150
配置邮件告警
global:
resolve_timeout: 5m
smtp_from: '439757183@qq.com'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '439757183@qq.com'
smtp_auth_password: 'kkzmprcswatubjab'
smtp_require_tls: false
smtp_hello: 'qq.com'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'wt439757183@126.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
2、prometheus告警配置
要让prometheus配置的告警规则可以发送到alertmanager,必须进行相应的配置
cat prometheus.yml
global:
scrape_interval: 15s
external_labels:
monitor: 'microservice-monitor'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090','cnode150:8080','cnode151:8080','cnode152:8080','enode150:9100','enode151:9100','enode152:9100']
- job_name: 'nacos'
scrape_interval: 5s
metrics_path: '/nacos/actuator/prometheus'
static_configs:
- targets: ['nacos:8848']
- job_name: 'wrapper-provider'
scrape_interval: 5s
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['provider:58080']
- job_name: 'wrapper-hello'
scrape_interval: 5s
static_configs:
- targets: ['hello:58080']
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "/etc/prometheus/rules/*.rules"
同时需要在prometheus的compose文件中添加映射
- "$WORK_HOME_PROMETHEUS/rules:/etc/prometheus/rules"
从上面可以看到配置了alertmanager的连接,现在需要针对具体的应用设置告警规则,这里就以监控nacos的可用性为例
cat node-up.rules
groups:
- name: node-up
rules:
- alert: node-up
expr: up{job="nacos"} == 0
for: 15s
labels:
severity: 1
team: node
annotations:
summary: "{{ $labels.instance }} 已停止运行超过 15s!"
3、测试
先看看正常情况下nacos的状态
访问http://192.168.0.150:9093可以查看告警是否被alertmanager接收
停止nacos服务
docker service rm 151_nacos
收到告警邮件
启动nacos服务
docker stack deploy -c docker-compose.yml 151
收到告警解除邮件
4、自定义邮件模板
自定义模板发送告警邮件需要修改alertmanager和prometheus相关配置并创建相关文件
定义邮件模板
cat email.tmpl
{{ define "email.from" }}439757183@qq.com{{ end }}
{{ define "email.to" }}wt439757183@126.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}
{{ end }}
修改alertmanager配置
cat alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: '439757183@qq.com'
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: '439757183@qq.com'
smtp_auth_password: 'kkzmprcswatubjab'
smtp_require_tls: false
smtp_hello: 'qq.com'
templates:
- '/etc/alertmanager/email.tmpl'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 5s
repeat_interval: 5m
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: '{{ template "email.to" . }}'
html: '{{ template "email.to.html" . }}'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
alertmanager的compose文件添加映射
- "$WORK_HOME_ALERTMANAGER/config/email.tmpl:/etc/alertmanager/email.tmpl"
重启alertmanager
同时也需要根据定义的模板中的变量修改prometheus的rules文件,添加如下配置
description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!"
测试
告警邮件
恢复邮件
5、要注意的问题
1、容器内的时区时间要准确
2、自定义邮件模板中的时间格式一定要准确使用golang的产生时间
3、告警邮件通过fire与resolve来表示告警和告警解除
4、对于一次告警生命周期(告警开始时间至告警解除时间),邮件中的时间表示的是其第一次告警触发的时间,同时因为采用的golang时间,对于中国时间需要加上8个小时,比如示例中的触发时间是凌晨2:48,在中国就表示上午10:48