基于devops利用微服务引入三方服务---alertmanager邮件告警配置示例

1、alertmanager部署

prometheus可以进行告警配置,会发送告警请求,该请求发送至alertmanager,由alertmanager来将相关信息发送给监控管理员

创建docker-compose.yml文件

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - 9093:9093
    networks:
      - monitor_overlay
    volumes:
      - "/etc/localtime:/etc/localtime:ro"
      - "$WORK_HOME_ALERTMANAGER/config/alertmanager.yml:/etc/alertmanager/alertmanager.yml"
    deploy:
      placement:
        constraints: [node.hostname==node150]
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
networks:
  monitor_overlay:
    external: true

启动服务
docker stack deploy -c docker-compose.yml 150

配置邮件告警

global:
  resolve_timeout: 5m
  smtp_from: '439757183@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: '439757183@qq.com'
  smtp_auth_password: 'kkzmprcswatubjab'
  smtp_require_tls: false
  smtp_hello: 'qq.com'
route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'email'
receivers:
- name: 'email'
  email_configs:
  - to: 'wt439757183@126.com'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

2、prometheus告警配置

要让prometheus配置的告警规则可以发送到alertmanager,必须进行相应的配置
cat prometheus.yml

global:
  scrape_interval: 15s
  external_labels:
    monitor: 'microservice-monitor'
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090','cnode150:8080','cnode151:8080','cnode152:8080','enode150:9100','enode151:9100','enode152:9100']
  - job_name: 'nacos'
    scrape_interval: 5s
    metrics_path: '/nacos/actuator/prometheus'
    static_configs:
      - targets: ['nacos:8848']
  - job_name: 'wrapper-provider'
    scrape_interval: 5s
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['provider:58080']
  - job_name: 'wrapper-hello'
    scrape_interval: 5s
    static_configs:
      - targets: ['hello:58080']

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093
      
rule_files:
  - "/etc/prometheus/rules/*.rules"

同时需要在prometheus的compose文件中添加映射
- "$WORK_HOME_PROMETHEUS/rules:/etc/prometheus/rules"

从上面可以看到配置了alertmanager的连接,现在需要针对具体的应用设置告警规则,这里就以监控nacos的可用性为例
cat node-up.rules

groups:
- name: node-up
  rules:
  - alert: node-up
    expr: up{job="nacos"} == 0
    for: 15s
    labels:
      severity: 1
      team: node
    annotations:
      summary: "{{ $labels.instance }} 已停止运行超过 15s!"

3、测试

先看看正常情况下nacos的状态
09api97

09api98

09api99

访问http://192.168.0.150:9093可以查看告警是否被alertmanager接收

09api106

停止nacos服务
docker service rm 151_nacos
收到告警邮件
09api100

启动nacos服务
docker stack deploy -c docker-compose.yml 151
收到告警解除邮件
09api101

4、自定义邮件模板

自定义模板发送告警邮件需要修改alertmanager和prometheus相关配置并创建相关文件

定义邮件模板
cat email.tmpl

{{ define "email.from" }}439757183@qq.com{{ end }}
{{ define "email.to" }}wt439757183@126.com{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} 级 <br>
告警类型: {{ .Labels.alertname }} <br>
故障主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}
{{ end }}

修改alertmanager配置
cat alertmanager.yml

global:
  resolve_timeout: 5m
  smtp_from: '439757183@qq.com'
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_auth_username: '439757183@qq.com'
  smtp_auth_password: 'kkzmprcswatubjab'
  smtp_require_tls: false
  smtp_hello: 'qq.com'
templates:
  - '/etc/alertmanager/email.tmpl'
route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 5s
  repeat_interval: 5m
  receiver: 'email'
receivers:
- name: 'email'
  email_configs:
  - to: '{{ template "email.to" . }}'
    html: '{{ template "email.to.html" . }}'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

alertmanager的compose文件添加映射
- "$WORK_HOME_ALERTMANAGER/config/email.tmpl:/etc/alertmanager/email.tmpl"

重启alertmanager

同时也需要根据定义的模板中的变量修改prometheus的rules文件,添加如下配置
description: "{{ $labels.instance }} 检测到异常停止!请重点关注!!!"

测试
告警邮件
09api103

恢复邮件
09api104

5、要注意的问题

1、容器内的时区时间要准确
2、自定义邮件模板中的时间格式一定要准确使用golang的产生时间
3、告警邮件通过fire与resolve来表示告警和告警解除
4、对于一次告警生命周期(告警开始时间至告警解除时间),邮件中的时间表示的是其第一次告警触发的时间,同时因为采用的golang时间,对于中国时间需要加上8个小时,比如示例中的触发时间是凌晨2:48,在中国就表示上午10:48