使用Fluentd生成Prometheus指标从日志中 (1/2)

2 年 ago

雅, 悟

4 minutes

最近，我有机会使用Kubernetes的Fluentd DaemonSet，并发现它已经集成了Prometheus插件。起初，我只是简单地认为“原来Fluentd自身的指标可以在Prometheus中查看”，但后来我发现使用这个插件可以从日志中生成指标。

因此，我們使用 Fluentd 來從 WAS Liberty 的訪問日誌中生成指標的驗證。

生成什么样的度量指标

本次我们将根据访问日志生成指标，首先，我考虑了哪些指标对我们有帮助。因为在系统运营过程中，有些信息我们希望能够迅速确认，所以我们决定从访问日志中生成以下指标。

メトリクス付加するタグ説明アクセス数ステータス・コード
パスの一部アクセス数がステータス・コード別やパス別に集計できる。レスポンス・サイズの合計ステータス・コード
パスの一部レスポンス・サイズがステータス・コード別やパス別に集計できる。
アクセス数で割ると平均のレスポンス・サイズが分かる。レスポンス時間の合計ステータス・コード
パスの一部レスポンス・サイズと同様

获得的指标数量依赖于标签可能的取值数量。状态码不会很多，但路径的数量可能很多。为了避免指标数量过大，对指定的标签需要特别注意。

请确认该指标与自由的指标是否有重叠。

稍微离题一下，虽然没有太多意义去生成已存在的指标，但为了保险起见，我会确认一下是否与 WAS Liberty 的指标有重复。
WAS Liberty 提供了以下这些指标。

Web アプリケーション・メトリック
スレッド・プール・メトリック
セッション管理メトリック
接続プール：メトリック
JAX-WS メトリック

参考

Open Liberty: Metrics reference list
WAS Liberty: MicroProfile Metrics 1.1 ベンダー・メトリック

似乎与我们计划实施的内容有些重复的是“Web应用指标”的部分，但提供的指标是RequestCount（每个Servlet的访问次数）和ResponseTimeDetails（每个Servlet的响应时间总和），所以应该没有重复。

自由的準備已經準備就緒。

为了使Fluentd的实现变得简单，我们使WAS Liberty输出Apache2 combined格式的访问日志。具体定义如下。第二个输出项本来应该是%l，但在这里我们用”-“代替。

<httpEndpoint host="*" httpPort="9080" httpsPort="9443" id="defaultHttpEndpoint">
    <accessLogging enabled="true"
                   filePath="/logs/http_access.log"
                   logFormat='%h - %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i"'/>
</httpEndpoint>

将“/logs/http_access.log”的访问日志输出到Fluentd可以读取的卷中。

通过访问日志生成Prometheus度量指标

我要进入正题之前，先做一个长篇的开场白。
我会按部就班地按照内容实现所说的部分来逐一列出 fluent.conf 的内容。

最初的配置是 Prometheus Plugin 的基本设置。
它的内容直接来自于在 Kubernetes 中使用的 Fluentd DaemonSet。只有这些内容，只能从 Prometheus 中获取 Fluentd 本身的指标。

<source>
  @type prometheus
  @id in_prometheus
  bind "0.0.0.0"
  port 24231
  metrics_path "/metrics"
</source>

<source>
  @type prometheus_output_monitor
  @id in_prometheus_output_monitor
</source>

接下来是关于定义访问日志进行tail操作和解析的部分。在Fluentd中，使用内置的apache2解析器来解析访问日志。
Fluentd还假设访问日志可以通过/logs/http_access.log进行访问配置。

<source>
  @type tail
  follow_inodes
  # read_from_head true
  path     /logs/http_access.log
  pos_file /logs/http_access.log.pos
  tag *
  <parse>
    @type apache2
  </parse>
</source>

添加以下定义可以检查中间状态。
（此定义最终是不必要的。如果保留，将导致Fluentd产生不必要的日志。）

<filter logs.http_access.log>
  @type stdout
  # format single_value
</filter>

使用Apache2解析器解析结果后，根据上述方法进行确认，结果如下：

{
  "host": "172.17.0.1",
  "user": null,
  "method": "POST",
  "path": "/ResourceEater/faces/HeapEater.xhtml",
  "code": 200,
  "size": 3815,
  "referer": "http://localhost:9080/ResourceEater/faces/HeapEater.xhtml",
  "agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"
}

我們將添加將路徑分為兩個部分的處理。
path_head 是包含在路徑中最後一個 “/” 之前的部分，而 path_rest 是剩餘的部分。

<filter logs.http_access.log>
  @type parser
  key_name path
  reserve_data true
  <parse>
    @type regexp
    expression /^(?<path_head>.*\/)(?<path_rest>.*)$/
  </parse>
</filter>

因为日志的解析已经完成，所以我们添加了生成 Prometheus 指标的定义。
在这里，我们使用的是值逐渐累加的计数器类型指标。访问次数只需要简单地计算数量，所以没有指定键（key）。而响应大小的累加则通过指定键（key）为 size 来指定大小进行累加。
所有这些指标都通过将 code 和 path_head 指定为标签（label），以便根据状态码和路径进行值的累加。

<filter logs.http_access.log>
  @type prometheus
  <metric>
    name liberty_access_total
    desc The total number of access
    type counter
    <labels>
      code ${code}
      path ${path_head}
    </labels>
  </metric>
</filter>

<filter logs.http_access.log>
  @type prometheus
  <metric>
    name liberty_response_size_total
    desc The total response size
    type counter
    key size
    <labels>
      code ${code}
      path ${path_head}
    </labels>
  </metric>
</filter>

由于日志内容不会输出到任何地方，所以最后需要写入以下指示。

<match logs.http_access.log>
  @type null
</match>

确认动作

启动WAS Liberty和Fluentd，并访问WAS Liberty上的应用程序。
使用curl访问Fluentd的Prometheus插件端口，并提取度量指标。

curl -s http://localhost:24231/metrics  | grep liberty
# TYPE liberty_access_total counter
# HELP liberty_access_total The total number of access
liberty_access_total{code="200",path="/InfraTest/"} 2.0
liberty_access_total{code="404",path="/"} 2.0
liberty_access_total{code="200",path="/ResourceEater/faces/"} 2.0
liberty_access_total{code="200",path="/InfraTest/rest/"} 1.0
liberty_access_total{code="200",path="/"} 2.0
liberty_access_total{code="200",path="/js/"} 2.0
liberty_access_total{code="200",path="/images/"} 3.0
liberty_access_total{code="200",path="/nls/ja/"} 1.0
# TYPE liberty_response_size_total counter
# HELP liberty_response_size_total The total response size
liberty_response_size_total{code="200",path="/InfraTest/"} 11366.0
liberty_response_size_total{code="404",path="/"} 29429.0
liberty_response_size_total{code="200",path="/ResourceEater/faces/"} 7630.0
liberty_response_size_total{code="200",path="/InfraTest/rest/"} 1241.0
liberty_response_size_total{code="200",path="/"} 12965.0
liberty_response_size_total{code="200",path="/js/"} 12795.0
liberty_response_size_total{code="200",path="/images/"} 23568.0
liberty_response_size_total{code="200",path="/nls/ja/"} 1875.0

当进一步访问并获取指标时，可以确认访问次数和响应大小会累加。

curl -s http://localhost:24231/metrics  | grep liberty
# TYPE liberty_access_total counter
# HELP liberty_access_total The total number of access
liberty_access_total{code="200",path="/InfraTest/"} 3.0
liberty_access_total{code="404",path="/"} 4.0
liberty_access_total{code="200",path="/ResourceEater/faces/"} 3.0
liberty_access_total{code="200",path="/InfraTest/rest/"} 2.0
liberty_access_total{code="200",path="/"} 2.0
liberty_access_total{code="200",path="/js/"} 2.0
liberty_access_total{code="200",path="/images/"} 3.0
liberty_access_total{code="200",path="/nls/ja/"} 1.0
liberty_access_total{code="200",path="/ResourceEater/"} 1.0
# TYPE liberty_response_size_total counter
# HELP liberty_response_size_total The total response size
liberty_response_size_total{code="200",path="/InfraTest/"} 17049.0
liberty_response_size_total{code="404",path="/"} 88287.0
liberty_response_size_total{code="200",path="/ResourceEater/faces/"} 13661.0
liberty_response_size_total{code="200",path="/InfraTest/rest/"} 2482.0
liberty_response_size_total{code="200",path="/"} 12965.0
liberty_response_size_total{code="200",path="/js/"} 12795.0
liberty_response_size_total{code="200",path="/images/"} 23568.0
liberty_response_size_total{code="200",path="/nls/ja/"} 1875.0
liberty_response_size_total{code="200",path="/ResourceEater/"} 274.0

最后

我们确认了WAS Liberty可以从输出的日志中生成指标。
由于这次我们无法处理总响应时间的指标，所以我们打算在另一篇文章中介绍这部分内容。

在分析容器日志时，我认为经常使用的一种模式是将日志传输到 Elasticsearch 等地方，通过 Kibana 等工具进行分析和展示。然而，将像访问日志这样大量输出的日志传输和存储到 Elasticsearch 等地方可能会因为容量和性能等方面而受到限制。

如果使用 Fluentd 进行频繁使用的常规分析，并将结果作为 Prometheus 指标进行收集和存储，可以在 Grafana 上查看结果，可能会降低一些门槛。

似乎可以在许多场景中利用，例如从应用程序输出的日志中生成指标等。

Fluentd: Monitoring by Prometheus
GitHub: fluent / fluent-plugin-prometheus
GitHub: fluent / fluentd-kubernetes-daemonset