基于devops利用微服务引入三方服务---alertmanager企业微信机器人告警示例

1、部署开源告警系统PrometheusAlert

上篇说到alertmanager接收prometheus发过来的alert,然后通过qq邮箱的smtp服务发送给管理员,如果要使用微信等应用来发送告警,需要配置一个类似smtp的发送端,一般来讲需要去企业微信开发平台创建应用,获取相应的accesskey等配置信息,会涉及企业的实名认证,作为个人为了测试,就在github上找了一个开源的告警项目PrometheusAlert,初步实现了利用企业微信机器人告警的需求

拉取镜像
docker pull feiyu563/prometheus-alert:latest

创建compose文件
cat docker-compose.yml

version: "3.6"
services:
  prometheusalert-center:
    image: feiyu563/prometheus-alert:latest
    ports:
       - "8080:8080"
    networks:
      - monitor_overlay
    volumes:
      - "/etc/localtime:/etc/localtime:ro"
      - ./config:/app/conf
    deploy:
      placement:
        constraints: [node.hostname==node150]
      restart_policy:
        condition: any
        delay: 5s
        max_attempts: 3
networks:
  monitor_overlay:
    external: true

app.conf是该项目的配置文件,可以开箱即用,下面的配置中以微信机器人为例
cat app.conf

#---------------------↓全局配置-----------------------
appname = PrometheusAlert
#监听端口
httpport = 8080
runmode = dev
#设置代理 proxy = http://123.123.123.123:8080
proxy = 
#开启JSON请求
copyrequestbody = true
#告警消息标题
title=PrometheusAlert
#链接到告警平台地址
GraylogAlerturl=http://graylog.org
#钉钉告警 告警logo图标地址
logourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#钉钉告警 恢复logo图标地址
rlogourl=https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png
#短信告警级别(等于3就进行短信告警) 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
messagelevel=3
#电话告警级别(等于4就进行语音告警) 告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
phonecalllevel=4
#默认拨打号码(页面测试短信和电话功能需要配置此项)
defaultphone=18327018707
#故障恢复是否启用电话通知0为关闭,1为开启
phonecallresolved=0
#自动告警抑制(自动告警抑制是默认同一个告警源的告警信息只发送告警级别最高的第一条告警信息,其他消息默认屏蔽,这么做的目的是为了减少相同告警来源的消息数量,防止告警炸弹,0为关闭,1为开启)
silent=0
#是否前台输出file or console
logtype=file
#日志文件路径
logpath=logs/prometheusalertcenter.log
#转换Prometheus,graylog告警消息的时区为CST时区(如默认已经是CST时区,请勿开启)
prometheus_cst_time=1

#---------------------↓webhook-----------------------
#是否开启钉钉告警通道,可同时开始多个通道0为关闭,1为开启
open-dingding=1
#默认钉钉机器人地址
ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxx
#是否开启 @所有人(0为关闭,1为开启)
dd_isatall=1

#是否开启微信告警通道,可同时开始多个通道0为关闭,1为开启
open-weixin=1
#默认企业微信机器人地址
wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9c26f445-4959-4bd3-8baf-fd1700ff4503

#是否开启飞书告警通道,可同时开始多个通道0为关闭,1为开启
open-feishu=0
#默认飞书机器人地址
fsurl=https://open.feishu.cn/open-apis/bot/hook/xxxxxxxxx


#---------------------↓腾讯云接口-----------------------
#是否开启腾讯云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-txdx=0
#腾讯云短信接口key
TXY_DX_appkey=xxxxx
#腾讯云短信模版ID 腾讯云短信模版配置可参考 prometheus告警:{1}
TXY_DX_tpl_id=xxxxx
#腾讯云短信sdk app id
TXY_DX_sdkappid=xxxxx
#腾讯云短信签名 根据自己审核通过的签名来填写
TXY_DX_sign=腾讯云

#是否开启腾讯云电话告警通道,可同时开始多个通道0为关闭,1为开启
TXY_DH_open-txdh=0
#腾讯云电话接口key
TXY_DH_phonecallappkey=xxxxx
#腾讯云电话模版ID
TXY_DH_phonecalltpl_id=xxxxx
#腾讯云电话sdk app id
TXY_DH_phonecallsdkappid=xxxxx

#---------------------↓华为云接口-----------------------
#是否开启华为云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-hwdx=0
#华为云短信接口key
HWY_DX_APP_Key=xxxxxxxxxxxxxxxxxxxxxx
#华为云短信接口Secret
HWY_DX_APP_Secret=xxxxxxxxxxxxxxxxxxxxxx
#华为云APP接入地址(端口接口地址)
HWY_DX_APP_Url=https://rtcsms.cn-north-1.myhuaweicloud.com:10743
#华为云短信模板ID
HWY_DX_Templateid=xxxxxxxxxxxxxxxxxxxxxx
#华为云签名名称,必须是已审核通过的,与模板类型一致的签名名称,按照自己的实际签名填写
HWY_DX_Signature=华为云
#华为云签名通道号
HWY_DX_Sender=xxxxxxxxxx

#---------------------↓阿里云接口-----------------------
#是否开启阿里云短信告警通道,可同时开始多个通道0为关闭,1为开启
open-alydx=1
#阿里云短信主账号AccessKey的ID
ALY_DX_AccessKeyId=LTAI4GAbh6ankr8B95qzDHWL
#阿里云短信接口密钥
ALY_DX_AccessSecret=Ux1hLVZvkXSQRJGEStkDfsQdVdXIoO
#阿里云短信签名名称
ALY_DX_SignName=prometheus
#阿里云短信模板ID
ALY_DX_Template=SMS_202560580

#是否开启阿里云电话告警通道,可同时开始多个通道0为关闭,1为开启
open-alydh=0
#阿里云电话主账号AccessKey的ID
ALY_DH_AccessKeyId=xxxxxxxxxxxxxxxxxxxxxx
#阿里云电话接口密钥
ALY_DH_AccessSecret=xxxxxxxxxxxxxxxxxxxxxx
#阿里云电话被叫显号,必须是已购买的号码
ALY_DX_CalledShowNumber=xxxxxxxxx
#阿里云电话文本转语音(TTS)模板ID
ALY_DH_TtsCode=xxxxxxxx

#---------------------↓容联云接口-----------------------
#是否开启容联云电话告警通道,可同时开始多个通道0为关闭,1为开启
RLY_DH_open-rlydh=0
#容联云基础接口地址
RLY_URL=https://app.cloopen.com:8883/2013-12-26/Accounts/
#容联云后台SID
RLY_ACCOUNT_SID=xxxxxxxxxxx
#容联云api-token
RLY_ACCOUNT_TOKEN=xxxxxxxxxx
#容联云app_id
RLY_APP_ID=xxxxxxxxxxxxx

#---------------------↓邮件配置-----------------------
#是否开启邮件
open-email=1
#邮件发件服务器地址
Email_host=smtp.qq.com
#邮件发件服务器端口
Email_port=465
#邮件帐号
Email_user=xxxxxxx@qq.com
#邮件密码
Email_password=xxxxxx
#邮件标题
Email_title=运维告警
#默认发送邮箱
Default_emails=xxxxx@qq.com,xxxxx@qq.com

2、prometheus告警配置

cat rules/nacos.rules

groups:
- name: node_alert
  rules:
  - alert: nacos状态
    expr: up{job="nacos"} == 0
    labels:
      name: prometheusalertcenter
      level: 3
    annotations:
      description: "{{ $labels.instance }}无响应"
      wxurl: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9c26f445-4959-4bd3-8baf-fd1700ff4503" 

3、alertmanager配置

cat alertmanager.yml

global:
  resolve_timeout: 5m
route:
  group_by: ['instance']
  group_wait: 10m
  group_interval: 10s
  repeat_interval: 10m
  receiver: 'web.hook.prometheusalert'
receivers:
- name: 'web.hook.prometheusalert'
  webhook_configs:
  - url: 'http://prometheusalert-center:8080/prometheus/alert'

4、测试

上面所有的组件配置好后,依次重启PrometheusAlert、alertmanager、prometheus
并分别打开prometheus的日志和PrometheusAlert的日志,如下所示为正常状态

docker service logs -f --tail 100 monitor_prometheus

删除nacos服务

10分钟后会收到邮件告警

启动nacos服务

10分钟后会受到恢复信息

具体如下图所示
09api108

注意:每隔10m一个告警的信息里的时间永远是显示该告警第一次发出告警的时间,但是其结束时间会是最新时间,也是接到恢复告警信息的实时时间,不要搞混,从上图我们可以看到nacos服务在下午15:45无响应,在16:09恢复正常

告警信息经过prometheus发给alertmanager,alertmanager通过webhook的方式转发到PrometheusAlert,PrometheusAlert发给微信机器人,最终呈现的告警信息如上图,看一下PrometheusAlert收到的告警信息情况
docker service logs -f --tail 50 150_prometheusalert-center

{"receiver":"web\\.hook\\.prometheusalert","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"nacos状态","instance":"container-nacos","job":"nacos","level":"3","monitor":"microservice-monitor","name":"prometheusalertcenter","service":"nacos-service"},"annotations":{"description":"container-nacos无响应","wxurl":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9c26f445-4959-4bd3-8baf-fd1700ff4503"},"startsAt":"2020-09-12T07:45:51.394923708Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://05a84f367c7f:9090/graph?g0.expr=up%7Bjob%3D%22nacos%22%7D+%3D%3D+0\u0026g0.tab=1","fingerprint":"923114ba8fc2ff40"}],"groupLabels":{"instance":"container-nacos"},"commonLabels":{"alertname":"nacos状态","instance":"container-nacos","job":"nacos","level":"3","monitor":"microservice-monitor","name":"prometheusalertcenter","service":"nacos-service"},"commonAnnotations":{"description":"container-nacos无响应","wxurl":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9c26f445-4959-4bd3-8baf-fd1700ff4503"},"externalURL":"http://90e7f728fcb7:9093","version":"4","groupKey":"{}:{instance=\"container-nacos\"}","truncatedAlerts":0}

发送的告警信息字段就截取至上面的内容

5、自定义模板方式发告警

上面可以看到告警信息其实非常丰富,如果想自由选择告警信息,那么就需要自定义告警模板了,这个自定义模板并非常规意义上的自定义模板,因为本文发的告警是在第三方告警系统上发布的,不是alertmanager直接发出的,而PrometheusAlert同样也提供了自定义模板的告警,需要去修改alertmanager的webhook配置
在进行修改前先来进行自定义模板告警测试 ,可以参考文档PrometheusAlert高级自定义消息模版

5.1.模板测试并获取webhook

开始之前,请先临时更改你的Alertmanager的配置,将所有告警信息都转发到PrometheusAlert自定义接口
cat alertmanager.yml

global:
  resolve_timeout: 5m
route:
  group_by: ['instance']
  group_wait: 10m
  group_interval: 10s
  repeat_interval: 10m
  receiver: 'PrometheusAlert'
receivers:
- name: 'PrometheusAlert'
  webhook_configs:
  - url: 'http://prometheusalert-center:8080/prometheusalert

重启alertmanager
查看PrometheusAlert日志
docker service logs -f --tail 100 150_prometheusalert-center

{"receiver":"PrometheusAlert","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"nacos状态","instance":"container-nacos","job":"nacos","level":"3","monitor":"microservice-monitor","name":"prometheusalertcenter","service":"nacos-service"},"annotations":{"description":"container-nacos无响应","wxurl":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9c26f445-4959-4bd3-8baf-fd1700ff4503"},"startsAt":"2020-09-15T02:51:51.394923708Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://5bf9f650cb08:9090/graph?g0.expr=up%7Bjob%3D%22nacos%22%7D+%3D%3D+0\u0026g0.tab=1","fingerprint":"923114ba8fc2ff40"}],"groupLabels":{"instance":"container-nacos"},"commonLabels":{"alertname":"nacos状态","instance":"container-nacos","job":"nacos","level":"3","monitor":"microservice-monitor","name":"prometheusalertcenter","service":"nacos-service"},"commonAnnotations":{"description":"container-nacos无响应","wxurl":"https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9c26f445-4959-4bd3-8baf-fd1700ff4503"},"externalURL":"http://43b023bc24da:9093","version":"4","groupKey":"{}:{instance=\"container-nacos\"}","truncatedAlerts":0}

以上这段就是告警的默认输出信息,对照该JSON开始编写模版,并在Dashboard上进行添加

以下是推荐模板里的一个示例

{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}}
[Prometheus恢复信息]($v.generatorURL}})
>**[{{$v.labels.alertname}}]({{$var}})**
>告警级别: {{$v.labels.level}}
告警接收: {{$.receiver}}
开始时间: {{GetCSTtime $v.startsAt}}
结束时间: {{GetCSTtime $v.endsAt}}
故障主机IP: {{$v.labels.instance}}
**{{$v.annotations.description}}**
{{else}}
[Prometheus告警信息]($v.generatorURL}})
>**[{{$v.labels.alertname}}]({{$var}})**
>告警级别: {{$v.labels.level}}
告警接收: {{$.receiver}}
开始时间: {{GetCSTtime $v.startsAt}}
结束时间: {{GetCSTtime $v.endsAt}}
故障主机IP: {{$v.labels.instance}}
**{{$v.annotations.description}}**
{{end}}
{{ end }}

访问PromethesuAlert的web界面 http://192.168.0.150:8080
09api110
点击template,选择企业微信,进行模板测试
09api111

点击模板测试,如果告警信息成功发出,说明该配置正确,注意设置的模板数据会存储到 db/PrometheusAlertDB.db,所以需要模板数据持久化,添加挂载
- ./db:/app/db
重启PrometheusAlert,重复进行一次测试,使模板数据本地持久化

5.2.正式使用自定义模板发送告警

打开PrometheusAlert Dashboard的模版管理页面AlertTemplate
09api112
复制企业微信的地址成如下
http://192.168.0.150:8080/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9c26f445-4959-4bd3-8baf-fd1700ff4503'
并将其设置到alertmanager的webhook链接,具体如下

global:
  resolve_timeout: 5m
route:
  group_by: ['instance']
  group_wait: 10m
  group_interval: 10s
  repeat_interval: 10m
  receiver: 'PrometheusAlert'
receivers:
- name: 'PrometheusAlert'
  webhook_configs:
  - url: 'http://prometheusalert-center:8080/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=9c26f445-4959-4bd3-8baf-fd1700ff4503'

重启alertmanager

删除nacos服务
告警信息发出
启动nacos服务
10m后
收到告警恢复信息
具体自定义告警信息如下
09958368728