Prometheus文档学习笔记

一、简介

1.1.概览

promethus是一款开源监控告警软件,自2012年推出以后很多公司开始使用这款软件,同时该项目的开发者社区也非常活跃
promethus特色:

  • 支持多维的数据模型,支持十进制和键值对的连续性数据
  • 灵活的查询语言PromQL去利用这个维度
  • 不依赖分布式存储,单个服务节点同步
  • 通过http协议来实现时间连续集合的收集
  • 通过中间网关支持推送时间序列
  • 通过服务发现或静态配置发现目标
  • 支持多种图形和仪表界面
    promethus组件:
  • 用于抓取和存储时间序列的主服务端
  • 应用客户端
  • 短暂任务的中间网关
  • HAProxy、STATSD、石墨等服务的特殊出口商
  • 处理告警的告警管理器
  • 多种工具
    大部分组件都是以go语言编写,可以通过二进制文件方便部署
    promethus结构:
    下面给出其图解说明图,该图包含了promethus生态组件
    03ada
    prometheus使用场景:
    它适用于纯数字的时间序列,对于机器中心和动态的面向服务的架构方面的监控都非常适合,尤其在微服务方面,支持多维数据集合和查询非常强大
    它非常的稳定,能够让你迅速的诊断出系统的故障,同时它的每一个节点都是独立的,不依赖于网络存储或其他远程服务,即使部分基础设备损坏,你也能够继续使用,同时不需要安装任何扩展基础插件
    promethus在价值可靠性方面,如果你总是可以查看关于系统的统计数据,甚至失败的条件并需要100%的准确性,比如每一个请求的准确性,那么它不适用,数据的采集不可能十分的具体和完善

1.2.prometheus学习第一步

下载promethus
wget https://github.com/prometheus/prometheus/releases/download/v2.7.2/prometheus-2.7.2.linux-amd64.tar.gz
tar zvxf prometheus-2.7.2.linux-amd64.tar.gz
cd prometheus-2.7.2.linux-amd64
可以看到一个二进制文件promethus(The Prometheus monitoring server),可以通过命令./promethus --help查看其选项
启动monitoring server前,我们需要对其进行配置

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

scrape_interval:抓取目标的时间间隔
evaluation_interval:评估规则的时间间隔
rule_files:指定The Prometheus monitoring server下载的规则的位置,这里我们没有设置这一项
scrape_configs:控制普罗米修斯监视器资源
启动promethus
./prometheus --config.file=prometheus.yml
可以使用地址http://ip:9090去访问
但是我们在进行数据查询的时候发现无数据,初步怀疑是应用的时间和浏览器系统时间不一致,但是系统时间和bios时间早已经是设置好了的,报的错误如下做个记录

Error on ingesting samples that are too old or are too far into the future

1.3.相关词汇

Alert:当监控指标超过设置的值会形成告警输出
Alermanager:接受告警输出,并将其发送给告警邮箱
Bridge:从客户端库进行抽样并将其暴露给promethus服务端的一个组件
Client library:从其他系统拉取数据并将其暴露给promethus服务端
Collector:采集数据器
Direct instrumentation:Direct instrumentation is instrumentation added inline as part the source code of a program
Endpoint:获取数据的来源
Exporter:暴露promethus指标的一个文件,暴漏的这些指标往往都是改变那些非promethus格式成promethus格式的指标中的
Instance:一个job中支持的独一无二的标签
Job:具有相同目的一些采集任务
Notification:体现一组多个告警的通知,并发送给邮件
Prometheus:Promethus的核心二进制文件或者整个Promethus监控系统
PromQL:promethus自带的查询语言
Pushgateway:推送最新的数据给promethus
Remote Read:远程读取是一种允许时间透明读取的普罗米修斯特性
Remote Read Adapter:支持promethus与被监控的系统之间的读取,将时序请求和相应之间进行转化
Remote Read Endpoint:promethus在进行Remote Read的对象
Remote Write:远程写入是一种允许发送采集样本的普罗米修斯特性
Remote Write Adapter:支持promethus在进行远程写入的时候将样本转变为被写入系统能理解的格式 Remote Write Endpoint:promethus在进行Remote Write的对象
Sample:时间序列中的一个时间点的单个值
Silence:警报管理器中的静默防止警报与标签匹配
Target:抓取对象的一个定义

二、概念

2.1.Data Mode

mtric name指的是类似可以检测到http请求总数之类指标的系统的特点
label使promethus支持高维数据模型

samples指来自时序数据,每个样本都由a float64 value和a millisecond-precision timestampzu组成

时序数据的标记格式

<metric name>{<label name>=<label value>, ...}  
例如:  api_http_requests_total{method="POST", handler="/messages"}

2.2.Metric Types

The Prometheus client libraries offer four core metric types:
Counter
Gauge
Histogram
Summary
这一节设计开发方面的知识,涉及到Go、Python、Ruby、Java,我不懂这些知识,也让我深刻的体会到学习一门系统编程语言的紧迫性

2.3.Jobs And Instances

在promethus中,你能抓取的数据点叫做实例,通常对应单个进程,而Job是指具有相同目的的多个进程集合,例如
an API server job with four replicated instances:

* job: api-server:  
     instance 1: 1.2.3.4:5670
     instance 2: 1.2.3.4:5671
     instance 3: 5.6.7.8:5670
     instance 4: 5.6.7.8:5671

当promethus抓取一个目标的时候,会自动地在时序数据上附加上标签
job: The configured job name that the target belongs to
instance: The <host>:<port> part of the target's URL that was scraped

三、Prometheus

3.1.Getting started

这一节是一种hello world式的指南,这里我们会学到如何去简单的安装、配置并使用Prometheus,你需要下载并在本地运行Prometheus,配置使其抓取本地机器,同时通过查询、规则和图形来利用收集到的时序数据
下载前面第一节里已经讲到了,这里略过,配置让其监控本地机器,尽管监控自身并没有态度的用处,但这不失为我们学习它的一个开端的好例子,现在让我们来配置一下,下面是一个最基本的配置
cat prometheus.yml

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']

启动Prometheus
./prometheus --config=prometheus.yml
使用expression browser来查看抓取到的数据,选择graph下的console,然后在expression blank中填入某个metric,比如指标prometheus_target_interval_length_seconds是用来检测抓取数据的时间间隔的,我们放入expression blank中,然后excute,得到图下结果
03bu-asd
上图是我们获取的时间序列数据,如果我们想知道时间序列个数,可以使用
count(prometheus_target_interval_length_seconds),返回值是5,表示获取了5个时间序列
也可以使用图形接口来展示指标数据
rate(prometheus_tsdb_head_chunks_created_total[1m])
下面我们来做点有意思的事情,我们下载GO Clinet并运行三个进程
在运行这三个进程前,我们需要配置go的路径,我们安装go1.11版本的,其路径配置如下

cat <<EOF>> /etc/profile
export GOROOT=/usr/local/go  //安装路径
export GOPATH=/root/go    //工作路径
export PATH=$PATH:$GOROOT/bin  
export GO111MODULE=off  //无模块支持,go会从GOPATH和vendor文件夹寻找包,这里我们关掉该参数

source /etc/profile
mkdir -p $GOPATH/src/
cd $GOPATH/src/
git clone https://github.com/prometheus/client_golang.git
git clone https://github.com/linuxwt/golang.org.git
cd client_golang/examples/random
go get -d
go build
./random -listen-address=:8080
./random -listen-address=:8081
./random -listen-address=:8082
add the following job definition to the scrape_configs section in your prometheus.yml and restart your Prometheus instance
cat prometheus.yml

scrape_configs:
  - job_name:       'example-random'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'

      - targets: ['localhost:8082']
        labels:
          group: 'canary'

注意上面是在原有的基础上添加
重启服务
在expression browser里执行rpc_durations_seconds,这会显示所有的时序数据,但有时候我们只想大概的知道一个平均值,那么可以使用avg(rate(rpc_durations_seconds_count[5m])) by (job, service)
我们还可以插入规则
cat prometheus.rules.yml

groups:
- name: example
  rules:
  - record: job_service:rpc_durations_seconds_count:avg_rate5m
    expr: avg(rate(rpc_durations_seconds_count[5m])) by (job, service)

然后完整的prometheus.yml应该是这样的
cat prometheus.yml

global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    monitor: 'codelab-monitor'

rule_files:
  - 'prometheus.rules.yml'

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:9090']
  - job_name:       'example-random'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s

    static_configs:
      - targets: ['localhost:8080', 'localhost:8081']
        labels:
          group: 'production'

      - targets: ['localhost:8082']
        labels:
          group: 'canary'

3.2.Installation

安装方式有三种:

  • 二进制安装(前面已经讲过)
  • 源码安装
  • docker安装
    这里着重说讲一下通过docker来部署
    cat docker-compose.yml
prometheus:
    restart: always
    image: prom/prometheus
    container_name: prometheus
    volumes:
        - /etc/localtime:/etc/localtime
        - /etc/timezone:/etc/timezone
        - $PWD/prometheus.yml:/etc/prometheus/prometheus.yml
    privileged: true
    ports:
        - 9090:9090

cat prometheus.yml

global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

docker-compose up -d启动容器,然后就可以访问prometheus了

3.3.Configuration

cat prometheus.yml

global:
  # How frequently to scrape targets by default.
  [ scrape_interval: <duration> | default = 1m ]

  # How long until a scrape request times out.
  [ scrape_timeout: <duration> | default = 10s ]

  # How frequently to evaluate rules.
  [ evaluation_interval: <duration> | default = 1m ]

  # The labels to add to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
    [ <labelname>: <labelvalue> ... ]

# Rule files specifies a list of globs. Rules and alerts are read from
# all matching files.
rule_files:
  [ - <filepath_glob> ... ]

# A list of scrape configurations.
scrape_configs:
  [ - <scrape_config> ... ]

# Alerting specifies settings related to the Alertmanager.
alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

# Settings related to the remote write feature.
remote_write:
  [ - <remote_write> ... ]

# Settings related to the remote read feature.
remote_read:
  [ - <remote_read> ... ]

更多具体的配置请参考文档

3.3.1.Recording Rules

prometheus提供两种规则:recording rules和alerting rules,可以通过以下工具来检查规则文件是否正确
go get github.com/prometheus/prometheus/cmd/promtool
promtool check rules /path/to/example.rules.yml

3.3.2.Alerting rules

3.3.3.Template examples

3.3.4.Template reference

3.3.5.Unit testing for rules

3.4.Querying

3.5.Stroage

3.6.Federation

3.7.Migration

3.8.API Stability

四、Visualization

这里主要是通过Grafana来展示prometheus的数据,所以需要先安装grafana,grafana的安装方式多样,这里选择docker安装,参考我的github上的简单部署
git clone https://github.com/linuxwt/grafana.git
启动容器后可以访问地址http://ip:3000,第一次会强制修改密码
创建prometheus的数据

  • Click on the Grafana logo to open the sidebar menu
  • Click on "Data Sources" in the sidebar
  • Click on "Add New"
  • Select "Prometheus" as the type
  • Set the appropriate Prometheus server URL (注意使用docker部署的要使用宿主机ip加上端口)
  • save

03asda

创建图形