尝试使用普罗米修斯

2 年 ago

雅, 悟

3 minutes

顺便一提，提到Prometheus，我想到我只在使用Kubernetes/OpenShift的麻烦结构上用过它，因此在这个空闲的GW时间里，我决定尝试最简配置。以下是我的记录。

环境

我们可以在AWS Lightsail的最低价格为3.5美元的实例上尝试使用Amazon Linux 2作为操作系统。

安装Prometheus

一开始先浏览一下“First steps”，然后由于以tgz格式提供，所以使用它。

$ wget https://github.com/prometheus/prometheus/releases/download/v2.35.0/prometheus-2.35.0.linux-amd64.tar.gz
$ tar xvfz prometheus-*.tar.gz
$ cd prometheus-*

然后，先用默认的配置文件启动Prometheus。

$ ./prometheus --config.file=prometheus.yml

Go的消息流动不断地，输出了“服务器已准备好接收网络请求”的信息。
通过9090/tcp端口可以访问Web UI，所以要打开AWS LightSail的防火墙以进行访问。虽然是无认证的，有点可怕。

所以，根据第一步的步骤，试着轻轻查询并显示图表。嗯，没有什么特别有趣的东西。

使用Ctrl+C暂停Prometheus，然后通过systemd进行修正，以便启动。

$ cd ~
$ sudo mkdir /prometheus
$ sudo cp -R prometheus-2.35.0.linux-amd64/* /prometheus/
$ cat > prometheus.service << EOF
[Unit]
Description=prometheus
After=network.target

[Service]
Type=simple
ExecStart=/prometheus/prometheus --config.file=/prometheus/prometheus.yml
Restart=always

[Install]
WantedBy=multi-user.target
EOF
$ sudo cp prometheus.service /etc/systemd/system/
$ sudo systemctl daemon-reload
$ sudo systemctl start prometheus
$ sudo systemctl enable prometheus

收集本地服务器的指标

首先，是否收集本地服务器的CPU和内存使用率？
安装node_exporter。这样做虽然麻烦，但从一开始就使用systemd进行启动。
https://prometheus.io/docs/guides/node-exporter/

$ wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
$ tar xvfz node_exporter-*.*-amd64.tar.gz
$ sudo cp node_exporter-*.*-amd64/node_exporter /prometheus/
$ cat > node_exporter.service << EOF
[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
ExecStart=/prometheus/node_exporter
Restart=always

[Install]
WantedBy=multi-user.target
EOF
$ sudo cp node_exporter.service /etc/systemd/system/
$ sudo systemctl daemon-reload
$ sudo systemctl start node_exporter
$ sudo systemctl enable node_exporter

在node_exporter启动后，配置Prometheus使其读取node_exporter的信息。
将Prometheus的配置文件按以下方式进行修改，将localhost的node_exporter（端口为9100）添加到static_configs中。

（一部抜粋）
...
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]
      - targets: ['localhost:9100']

如果进行更改，则重新启动Prometheus。

$ sudo systemctl restart prometheus

在Prometheus重新启动后，通过Prometheus的用户界面确认是否可以查看node_exporter的指标。
尝试使用”node_filesystem_avail_bytes”之类的查询语句。

好像能看到了。

我试着发出警报。

顺便试试以磁盘使用率为基准发送警报吧。
从这里开始稍微有点难度增加。

引入警报管理器。
https://prometheus.io/docs/alerting/latest/configuration/

$ wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
$ tar xvzf alertmanager-0.24.0.linux-amd64.tar.gz
$ sudo cp alertmanager-*-amd64/alertmanager /prometheus/
$ sudo cp alertmanager-*-amd64/alertmanager.yml /prometheus/
$ cat > alertmanager.service << EOF
[Unit]
Description=alertmanager
After=network.target

[Service]
Type=simple
ExecStart=/prometheus/alertmanager --config.file=/prometheus/alertmanager.yml
Restart=always

[Install]
WantedBy=multi-user.target
EOF
$ sudo cp alertmanager.service /etc/systemd/system/
$ sudo systemctl daemon-reload
$ sudo systemctl start alertmanager
$ sudo systemctl enable alertmanager

首先，在Prometheus中设置警报规则。
首先，创建以下规则文件。

groups:
- name: alertrules-fs
  rules:
  - alert: HighDiskUsage-root
    expr: node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} < 0.2
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: High Disk Usage ("/")

在Prometheus中启用AlertManager（目标为localhost:9093）和警报规则。

（一部抜粋）
...
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
      - targets:
        - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "alertrules-fs.yml"
...

修改文件后，重新启动prometheus。

$ sudo systemctl restart prometheus

在Prometheus重新启动后，当访问UI时，可以确认已添加了警报。

可以在 “Status” > “Runtime & Build Information” 中确认是否添加了 AlertManager。

在发送实际警报之前，AlertManager的默认设置要求在localhost:5001上发送WebHook，因此需要准备一个接收该POST请求的应用程序。
嗯，用Python的http.server就可以了。

$ cat > webhook << EOF
#! /usr/bin/python3
import http.server
class h(http.server.BaseHTTPRequestHandler):
  def do_POST(self):
    l = int(self.headers['content-length'])
    print('body = {}'.format(self.rfile.read(l).decode('utf-8')))
    self.send_response(200)
    self.send_header("Content-type", "text/plain; charset=UTF-8")
    self.send_header("Content-Length", "0")
    self.end_headers()
a = ('localhost', 5001)
s = http.server.HTTPServer(a, h)
s.serve_forever()
EOF
$ chmod 700 webhook
$ sudo cp webhook /prometheus/
$ cat > webhook.service << EOF
[Unit]
Description=webhook
After=network.target

[Service]
Type=simple
ExecStart=/prometheus/webhook
Restart=always
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target
EOF
$ sudo cp webhook.service /etc/systemd/system/
$ sudo systemctl daemon-reload
$ sudo systemctl start webhook
$ sudo systemctl enable webhook

那么，让我们把Alert设为Active。通过df命令检查，EC2虚拟机的”/”文件系统总容量大约为20G，用于存储操作系统和Prometheus二进制文件等约占用2GB，因此我要创建一个大小为16GB的文件。

$ fallocate -l 16G dummy

在执行命令后，再次使用df命令进行确认，可以看到”/”的剩余容量为91%。

$ df
Filesystem     1K-blocks     Used Available Use% Mounted on
devtmpfs          237048        0    237048   0% /dev
tmpfs             244868        0    244868   0% /dev/shm
tmpfs             244868      448    244420   1% /run
tmpfs             244868        0    244868   0% /sys/fs/cgroup
/dev/xvda1      20959212 18975528   1983684  91% /
tmpfs              48976        0     48976   0% /run/user/1000

Webhook的接收可以通过journald或者在/var/log/messages中进行确认。

嗯，還不錯吧。

环境

安装Prometheus

收集本地服务器的指标

我试着发出警报。

其他