使用Grafana注释和Prometheus来简洁地总结测试结果

2 年 ago

韵, 科

9 minutes

你好，我是 @haeena。
这篇文章似乎是NTT通信公司2017年圣诞日历的第14天。

从2017年10月开始，Grafana 4.6发布，允许在Grafana自身中记录注释。这次我们将利用Grafana的注释功能，大量生成试验结果的图表。同时还能导出PNG图片以及仪表盘的快照。

在本文中，我们将利用API来进行标注，但是如果直接从仪表盘上进行标注，则会出现如上图所示的感觉。

鼓舞

結果はグラフが残っていれば、まぁいいや
どちらかというと試験方法が再現可能なことが大事だよね

组成

这次我们尝试使用上述的结构。

我們需要準備一個測試伺服器和一個測量伺服器（這次使用我手上的Mac），並在測試伺服器上進行一些測試。在指標收集（Grafana的資料來源）方面，我們使用Prometheus/node_exporter。由於Prometheus的設置非常簡單，我們想在一些小型測量中也使用它，但是它往往會將過去的數據悄悄刪除。此外，Prometheus沒有記錄事件發生時間的功能，因此我們將其與Grafana的annotation+snapshot/export結合起來補充。

所使用的软件版本如下：
Grafana 4.6.2
Prometheus 2.0
node_exporter 0.15.2

笔记本电脑使用的是装有macOS HighSierra的操作系统，并且安装了docker for Mac和docker-compose。目标服务器上运行的是Ubuntu 16.04 (x86_64)系统。

建立

计量服务器的配置

提升Grafana/Prometheus

使用Docker Compose 快速地启动 Grafana/Prometheus。

version: '3'
services:
  grafana:
    image: grafana/grafana
    container_name: grafana
    ports:
      - 3000:3000
    env_file:
      - grafana.env
  prometheus:
    image: prom/prometheus
    container_name: prometheus
    volumes:
      - prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - 9090:9090

GF_PATH_DATA=/var/lib/grafana/data
GF_SECURITY_ADMIN_PASSWORD=secret
GF_SERVER_ROOT_URL=http://localhost:3000

global:
  scrape_interval:     5s
  evaluation_interval: 5s
  external_labels:

rule_files:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets:
        - 'localhost:9090'
  - job_name: 'node'
    static_configs:
      - targets:
        - 'TARGET_SERVER_IP:9100'

请将TARGET_SERVER_IP设置为要测量的服务器IP。

> pip install docker-compose
> docker-compose up -d

获取Grafana API令牌，指定标签。

> curl -X POST -H "Content-Type: application/json" -d '{"name":"apikey", "role": "Admin"}' http://admin:secret@localhost:3000/api/auth/keys
{"name":"apikey","key":"API_TOKEN"}

发行admin角色的API令牌，并将其记录在文件中以备后用
同时我们还可以定义其他环境变量，以备后续使用。

GRAFANA_API_TOKEN=API_TOKEN
GRAFANA_BASE_URL=http://localhost:3000
TEST_NAME=nttcomadvent2017
TEST_TAG=test_tag
SERIES_TAG=series_tag

仪表盘的配置（标签的关联）

首先，请根据您的喜好进行dashboard的配置。
暂时我们将使用node_exporter，它是下载次数最多的，来[导入并使用Node Exporter Server Metrics](https://grafana.com/dashboards/405)。

在粗略查看数据后，我感到有点不满意，所以我也尝试导入了Node Exporter Full。

当然可以通过API实施上述设置。

考试对象的服务器端设置

获取并启动node_exporter

这里不是正式的论述，所以暂时先用 nohup 上载吧。

curl -L -O https://github.com/prometheus/node_exporter/releases/download/v0.15.2/node_exporter-0.15.2.linux-amd64.tar.gz
tar zxvf node_exporter-0.15.2.linux-amd64.tar.gz
cd node_exporter-0.15.2.linux-amd64
nohup ./node_exporter &

將環境變量文件引入/配置測試腳本。

将在计测服务器端定义的变量定义引入测试对象服务器。此时，将计测服务器的IP替换为GRAFANA_BASE_URL中的主机名。

GRAFANA_API_TOKEN=API_TOKEN
GRAFANA_BASE_URL=http://METRICS_SERVER_IP:3000
TEST_NAME=nttcomadvent2017
TEST_TAG=test_tag
SERIES_TAG=series_tag

放置測試腳本。可以測試任何內容。這次我們試試運行fio。
在測試的開始和結束時間段內，使用TEST_TAG進行標註，並在每次fio執行之後使用SERIES_TAG進行標註。

#!/bin/bash
set -eu

. .env

## Grafana Web API で annotation を作製する関数
annotation_create() {
  local api_ep=${GRAFANA_BASE_URL}/api/annotations
  local method=POST

  local time_from=$1
  local time_to=$2
  local tag=$3
  local description=$4

  curl -v -X ${method} -H "Authorization: Bearer ${GRAFANA_API_TOKEN}" -H "Content-Type: application/json" ${api_ep} \
  -d '
  {
    "time": '"'${time_from}'"',
    "isRegion": true,
    "timeEnd": '"'${time_to}'"',
    "tags": ['"'${tag}'"'],
    "text": "'"'${description}'"'
  }
  '
}

## 試験期間全体を `TEST_TAG` で annotate する
## 試験実施中、区切りのいいタイミングで `SERIES_TAG` で annotationを作る
test() {
  test_time_from="$(date +%s%3N)"
  sleep 30

  for i in {1..6}; do
    jobs=$((2**$i))
    fio_time_from="$(date +%s%3N)"
    fio -filename=/tmp/test2g -direct=1 -bs=4k -size=2G -numjobs=${jobs} -runtime=300 -name=test$i
    fio_time_to="$(date +%s%3N)"
    annotation_create $fio_time_from $fio_time_to $SERIES_TAG "fio,-direct=1,-bs=4k,-size=2G,-numjobs=${jobs},-runtime=300,-test=$i"
  done
  sleep 30
  test_time_end="$(date +%s%3N)"
  annotation_create $test_time_from $test_time_to $TEST_TAG ""
}

test

那么，我们执行测试脚本。

> bash test_runner.sh
(snip)

*   Trying ::1...
* Connected to TARGET_SERVER_IP port 3000 (#0)
> POST /api/annotations HTTP/1.1
> Host: TARGET_SERVER_IP:3000
> User-Agent: curl/7.47.0
> Accept: */*
> Authorization: Bearer API_TOKEN
> Content-Type: application/json
> Content-Length: 143
>
* upload completely sent off: 143 out of 143 bytes
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Thu, 14 Dec 2017 15:02:01 GMT
< Content-Length: 30
<
* Connection #0 to host localhost left intact
{"message":"Annotation added"}

似乎已經添加了註釋，沒什麼大事。

确认注释

“The added annotation is.” (追加的标注是。)

捕捉

在捕捉PNG图像的同时，也会在Grafana上保留快照。

让我们同时获取Dashboard的整体截图和每个面板的单独截图。
我认为将获取的PNG图像添加到幻灯片中会很不错。
顺便说一句，据说Grafana在生成图像时使用了phantomjs。

此外，快照从数据源中提取出绘制仪表板所需的数据，并存储在Grafana的数据库中，因此即使在Prometheus方面数据消失，也不必担心。另外，与保存为图像不同，还可以直接使用Grafana的互动式用户界面的优点。

暫時我準備了以下這樣的腳本。

只需一种选项：
做的事情很简单，
1）从TEST_TAG中寻找仪表板
2）从TEST_TAG中寻找注解（时间从，时间至）
3）将找到的仪表板与注解（时间段）组合保存为仪表板图像和快照，顺便生成每个面板的图像。

我在思考是否有一个好用的库可以调用Grafana的API，但是没有找到，所以只能用requests辛苦地写了一段代码，结果变得有点长了。（是不是应该把它上传到gist之类的地方？）

import os
import re
import datetime
import requests
import shutil

GRAFANA_API_TOKEN = os.environ.get("GRAFANA_API_TOKEN")
GRAFANA_BASE_URL = os.environ.get("GRAFANA_BASE_URL") or "http://localhost:3000"

TEST_NAME = os.environ.get("TEST_NAME") or "TEST"
TEST_TAG = os.environ.get("TEST_TAG") or "tag"

CAPTURE_DIR = os.environ.get("CAPTURE_DIR") or ""

def get_annotations(time_from=None, time_to=None, alertId=None, dashboardId=None, panelId=None, tags=[], limit=None):
    api_ep = "{}/api/annotations".format(GRAFANA_BASE_URL)
    method = "GET"
    headers = {
        "Authorization": "Bearer {}".format(GRAFANA_API_TOKEN),
        "Content-Type": "application/json"
    }

    params = {}
    if time_from:
        params["time_from"] = time_from
    if time_to:
        params["time_to"] = time_to
    if alertId:
        params["alertId"] = alertId
    if dashboardId:
        params["dashboardID"] = dashboardId
    if panelId:
        params["panelId"] = panelId
    if tags:
        params["tags"] = tags
    if limit:
        params["limit"] = limit

    response = requests.request(
        method,
        api_ep,
        params=params,
        headers=headers)

    return response.json()

def get_dashboard(slug):
    api_ep = "{}/api/dashboards/db/{}".format(GRAFANA_BASE_URL, slug)
    method = "GET"
    headers = {
        "Authorization": "Bearer {}".format(GRAFANA_API_TOKEN),
        "Content-Type": "application/json"
    }

    response = requests.request(
        method,
        api_ep,
        headers=headers)

    return response.json()

def search_dashboards(query=None, tag=None, starred=None, tagcloud=None):
    api_ep = "{}/api/search".format(GRAFANA_BASE_URL)
    method = "GET"
    headers = {
        "Authorization": "Bearer {}".format(GRAFANA_API_TOKEN),
        "Content-Type": "application/json"
    }

    params = {}
    if query:
        params["query"] = query
    if tag:
        params["tag"] = tag
    if starred:
        params["starred"] = starred
    if tagcloud:
        params["tagcloud"] = tagcloud

    response = requests.request(
        method,
        api_ep,
        params=params,
        headers=headers)

    return response.json()

def create_snapshot(dashboard, name=None, expire=None, external=None, key=None, deleteKey=None, time_from=None, time_to=None):
    api_ep = "{}/api/snapshots".format(GRAFANA_BASE_URL)
    method = "POST"
    headers = {
        "Authorization": "Bearer {}".format(GRAFANA_API_TOKEN),
        "Content-Type": "application/json"
    }

    dashboard = dashboard["dashboard"] if "dashboard" in dashboard else dashboard

    if time_from:
        dashboard["time"]["from"] = timestr_from_unix_ms(time_from)
    if time_to:
        dashboard["time"]["to"] = timestr_from_unix_ms(time_to)

    post_json = {
        "dashboard": dashboard
    }
    if name:
        post_json["name"] = name
    if expire:
        post_json["expire"] = expire
    if external:
        post_json["external"] = external
    if key:
        post_json["key"] = key
    if deleteKey:
        post_json["deleteKey"] = deleteKey

    response = requests.request(
        method,
        api_ep,
        json=post_json,
        headers=headers)

    return response.json()

def save_rendered_dashbaord_to_file(slug, filename, vars=None, panelId=None, width=None, height=None, tz=None, timeout=None, time_from=None, time_to=None):
    api_ep = "{}/render/dashboard/db/{}".format(GRAFANA_BASE_URL, slug)
    method = "GET"
    headers = {
        "Authorization": "Bearer {}".format(GRAFANA_API_TOKEN),
        "Content-Type": "application/json"
    }

    params = {}
    if panelId:
        params["panelId"] = panelId
        api_ep = api_ep.replace("/render/dashboard", "/render/dashboard-solo")
    if width:
        params["width"] = width
    if height:
        params["height"] = height
    if tz:
        params["tz"] = tz
    if timeout:
        params["timeout"] = timeout
    if time_from:
        params["from"] = time_from
    if time_to:
        params["to"] = time_to
    if vars:
        for var, value in vars.items():
            var_name = "var-{}".format(var)
            params[var_name] = value

    path = os.path.join(CAPTURE_DIR, filename)

    response = requests.request(
        method,
        api_ep,
        params=params,
        headers=headers,
        stream=True)
    if response.status_code == 200:
        with open(path, 'wb') as f:
            response.raw.decode_content = True
            shutil.copyfileobj(response.raw, f)
    return

def timestr_from_unix_ms(unix_ms):
    return datetime.datetime.utcfromtimestamp(int(unix_ms/1000)).strftime("%Y-%m-%dT%H:%M:%S.000Z")

def extract_panels_from_dashboard(dashboard):
    dashboard = dashboard["dashboard"] if "dashboard" in dashboard else dashboard

    panels = []
    for row_id, row in enumerate(dashboard["rows"]):
        height = row["height"].replace("px","")
        for panel in row["panels"]:
            panel_id = panel["id"]
            title = panel["title"]
            panels.append({"row_id": row_id, "panel_id": panel_id, "title": title, "height": height})

    return panels

def extract_panels_from_dashboard(dashboard):
    dashboard = dashboard["dashboard"] if "dashboard" in dashboard else dashboard

    panels = []
    for row_id, row in enumerate(dashboard["rows"]):
        height = int(re.match("[\d]*", row["height"])[0]) if isinstance(row["height"], str) else row["height"]
        for panel in row["panels"]:
            panel_id = panel["id"]
            title = panel["title"]
            panels.append({"row_id": row_id, "panel_id": panel_id, "title": title, "height": height})

    return panels

def main():
    test_tag = TEST_TAG

    ## retrieve list of dashboard info matching test tag
    dashboards_info = search_dashboards(tag=test_tag)

    # create list of dashbaord json
    dashboards = {}
    for dashboard_info in dashboards_info:
        slug = os.path.basename(dashboard_info["uri"])
        dashboards[slug] = get_dashboard(slug)

    # search list of annotation matching test tag
    annotations = get_annotations(tags=[test_tag])

    # create range by pairing annotations
    time_regions = {}
    for regionId in set(map(lambda x: x["regionId"], annotations)):
        #annotation_pair = filter(lambda x: x["regionId"]==regionId, annotations)
        #time_pair = sorted(map(lambda x: x["time"], annotation_pair))
        time_pair = sorted([a["time"] for a in annotations if a["regionId"] == regionId])
        region_str = "{0[0]}_{0[1]}".format(tuple(map(timestr_from_unix_ms, time_pair)))
        time_regions[region_str] = time_pair

    # for all dashboards, for all time region, mathcing test tag
    for slug, dashboard in dashboards.items():
        for region_str, v in time_regions.items():
            snapshot_name = "{}_{}_{}".format(TEST_NAME, slug, region_str)
            capture_name = snapshot_name + ".png"
            time_from = v[0]
            time_to = v[1]

            # create snapshot w/ name
            create_snapshot(dashboard, name=snapshot_name, time_from=time_from, time_to=time_to)

            # capture whole dashboard
            save_rendered_dashbaord_to_file(slug, capture_name, timeout=3000, time_from=time_from, time_to=time_to)

            # capture panels for dashboards
            panels = extract_panels_from_dashboard(dashboard)
            for panel in panels:
                row_id = panel["row_id"]
                panel_id = panel["panel_id"]
                height = panel["height"]
                panel_capture_name = "{}_{}_{}_{}_{}.png".format(TEST_NAME, slug, row_id, panel_id, region_str)
                save_rendered_dashbaord_to_file(slug, panel_capture_name, panelId=panel_id, height=height, timeout=60, time_from=time_from, time_to=time_to)

if __name__ == "__main__":
    main()

进行

> . .env
> pip install requests
> python capture.py

...

根据导出的仪表盘数据量不同，生成抓取可能需要一些时间。
有时会出现PhantomJS的问题，这种情况下最好重启整个Grafana。

结果 or 成果

图片 (tú

因为对于每个面板都进行了图像精炼处理，所以当应用在面板较多的仪表盘上时会非常出色。
如果只选择一个面板进行处理，会得到如下所示的效果。

nttcomadvent2017_node-exporter-full_8_8_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png

总体来说，我们制作了很多图片。

> ls *.png
...
nttcomadvent2017_node-exporter-full_9_33_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-full_9_34_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-full_9_34_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-full_9_35_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-full_9_35_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-full_9_36_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-full_9_36_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-full_9_37_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-full_9_37_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-full_9_66_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-full_9_66_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-full_9_9_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-full_9_9_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-server-metrics_0_11_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-server-metrics_0_11_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-server-metrics_10_12_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-server-metrics_10_12_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-server-metrics_11_21_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
nttcomadvent2017_node-exporter-server-metrics_11_21_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png
nttcomadvent2017_node-exporter-server-metrics_12_23_2017-12-14T15:09:13.000Z_2017-12-14T15:11:34.000Z.png
...

nttcomadvent2017_node-exporter-server-metrics_2017-12-14T15:13:48.000Z_2017-12-14T15:22:58.000Z.png

很遗憾的是，画像导出时注释的文字无法显示。

快照

最后

使用Grafana可以轻松生成图像，因此当我变得有点得意时，我没有想到仅在node_exporter的仪表板上就能生成近370个图像。

可能没有顺利进行。
也许还没有整理好。

无论如何，至少部署简便，所以在想要捕捉一些小型测试结果时，Grafana + Prometheus是不是很方便。

Grafana现在可以记录事件时间（时刻或期间）、描述等到自身的数据库中。也可以通过Grafana的GUI添加注释。在4.6之前，只支持从外部数据源导入和显示。通过在面板上按下Ctrl或Cmd +单击，可以从GUI中添加注释。要在PowerPoint上插入图形并添加注释，以便让那些无法通过直接查看Grafana来理解的人理解，请您理解。如果要保存指标到外部的TSDB，或者不通过–storage.tsdb.retention来延长，默认情况下它们将在15天后消失。