node-problem-detector：使用任何监控脚本可以更改Kubernetes节点的条件

3 年 ago

雅, 悟

3 minutes

首先

在这里，我们将介绍一个名为node-problem-detector（NPD）的自定义插件，它允许我们执行任意脚本来修改节点的条件或创建事件。需要注意的是，此处使用的NPD配置文件已在v0.8.2进行验证。

node-problem-detector 是什么？

node-problem-detector是一个DaemonSet，会在每个节点上运行，并在发现节点问题时，修改Kubernetes节点的条件或创建事件。

以下是作为发现节点问题的手段的一些插件提供的选项。

filelog プラグイン: 任意のログファイル

journald プラグイン: journald のログ

kmsg プラグイン: カーネルログ(/dev/kmsg)

カスタムプラグインモニタ: 任意の監視スクリプトで問題を発見する

custom プラグイン: 任意の監視スクリプトの標準出力と終了コードで問題を通知する

[Kubernetes] node-problem-detector 可用于监视节点，并通过系统日志监视器进行解释，可以参考 AI tech studio。

此外，我们还支持输出指标数据。

システム統計モニタ: CPU やディスク、メモリといったコンポーネントの統計情報を出力する

在这里，我们将讨论关于自定义插件监视器的自定义插件的内容。

通过自定义插件可以实现的功能。

定制插件监视器能够执行任意监视脚本，并通过监视脚本的退出代码和标准输出来判断是否发现了问题。因此，基本上可以实现任何功能，但监视脚本是在NPD容器中运行的，因此NPD容器镜像必须包含所需的运行内容。因此，大多数情况下会使用Shell脚本来实现，但Go语言也可能很适合。如果想使用Python等实现，就需要自己构建包含Python的镜像，基于NPD镜像作为基础镜像进行构建。

自定义插件的配置

自定义插件的配置以 JSON 格式的文件描述如下。

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "5s",
    "max_output_length": 80,
    "concurrency": 3,
    "enable_message_change_based_condition_update": false
  },
  "source": "upfile-monitor",
  "metricsReporting": true,
  "conditions": [
    {
      "type": "UpFile",
      "reason": "UpFileExists",
      "message": "up file exists"
    }
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "UPFileDoesNotExist",
      "path": "/custom-config/plugin/check_up_file.sh",
      "timeout": "3s"
    },
    {
      "type": "permanent",
      "condition": "UpFile",
      "reason": "UpFileDoesNotExist",
      "path": "/custom-config/plugin/check_up_file.sh",
      "timeout": "3s"
    }
  ]
}

插件是用来描述插件类型的。如果是自定义插件，则为custom。

pluginConfig是插件的配置。可以根据不同类型的插件进行配置。

invoke_interval: カスタムプラグインが呼び出される間隔。デフォルトは 30s。

timeout: カスタムプラグインの呼び出しが終了してタイムアウトと見なされるまでの時間。タイムアウトするとステータス Unknown の扱いとなる。デフォルトは 5s。

max_output_length: カスタムプラグインの標準出力からの出力をカットするサイズ。カット後の出力がコンディションのステータスメッセージとして使われる。デフォルトは 8。

concurrency: プラグインのワーカ数。1つのカスタムプラグインのなかで複数のルールがある場合にそれらの同時呼び出し数に当たる。デフォルトは 3。

enable_message_change_based_condition_update: メッセージ（カスタムプラグインの標準出力からの出力）の変更によりコンディションを更新するかどうか。デフォルトは false。

源头在事件创建时作为报告来源组件使用。具体而言，它作为事件对象的`.source.component`字段的值使用。

是否将 metricsReporting 设置为 true，将决定是否将其输出为 Prometheus 指标。如果设置为 true，则会包含在 NPD 的 Prometheus 指标中。

# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="UPFileDoesNotExist"} 17
problem_counter{reason="UpFileDoesNotExist"} 1
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="UpFileDoesNotExist",type="UpFile"} 1
problem_gauge{reason="UpFileExists",type="UpFile"} 0

条件是默认条件的设置。描述状态正常时的条件。在这里设置的条件将在状态为False时生效。

规则是设置执行监视脚本的规则。

type: temporary または permanent のどちらか

temporary: Event の作成

permanent: ノード Condition の変更

reason: Event およびノードの Condition の Reason に使われる文字列

path: 実行する監視スクリプトのパス

timeout: /pluginConfig/timeout の設定を上書きする設定

实施监视脚本

为了与NPD合作，监视脚本的实施必须遵循以下规则。

根据问题的发现与否，监视脚本的结束代码将作出相应的更改。

0: OK. 問題がない場合

temporary: Event が作成されない

permanent: ノードの Condition が False に設定される

1: NONOK. 問題がある場合

temporary: Event が作成される（Severity は Warning）

permanent: ノードの Condition が True に設定される

2: Unknown. そのほか

temporary: Event が作成される（Severity は Warning）

permanent: ノードの Condition が Unknown に設定される

如果标准输出超过了自定义插件的设置中“/pluginConfig/max_output_length”的指定字符串长度，那么它将被截断。

temporary: Event message フィールドの文字列

permanent: ノードの Condition の message フィールドの文字列

以下是一个检查节点上是否存在 /custom-data/up 文件的监控脚本示例。

#!/usr/bin/env bash

set -e -o pipefail; [[ -n "$DEBUG" ]] && set -x

readonly OK=0
readonly NONOK=1
readonly UNKNOWN=2

readonly UP_FILE="/custom-data/up"

if [[ -f "$UP_FILE" ]]; then
  echo "$UP_FILE exists"
  exit $OK
fi

echo "$UP_FILE does not exist"
exit $NONOK
# vim: ai ts=2 sw=2 et sts=2 ft=sh

通过目前的自定义插件配置和监视脚本所创建的事件和节点条件如下所示。

事件

$ kubectl get events
LAST SEEN   TYPE      REASON               OBJECT          MESSAGE
46m         Normal    UpFileDoesNotExist   node/minikube   Node condition UpFile is now: True, reason: UpFileDoesNotExist
116s        Warning   UPFileDoesNotExist   node/minikube   /custom-data/up does not exist

节点条件

$ kubectl describe node minikube | grep -A 9 Conditions
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  KernelDeadlock       False   Wed, 05 Aug 2020 14:03:52 +0900   Wed, 05 Aug 2020 13:17:44 +0900   KernelHasNoDeadlock          kernel has no deadlock
  ReadonlyFilesystem   False   Wed, 05 Aug 2020 14:03:52 +0900   Wed, 05 Aug 2020 13:17:44 +0900   FilesystemIsNotReadOnly      Filesystem is not read-only
  UpFile               True    Wed, 05 Aug 2020 14:03:52 +0900   Wed, 05 Aug 2020 13:18:44 +0900   UpFileDoesNotExist           /custom-data/up does not exist
  MemoryPressure       False   Wed, 05 Aug 2020 14:01:33 +0900   Thu, 30 Jul 2020 14:40:50 +0900   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 05 Aug 2020 14:01:33 +0900   Thu, 30 Jul 2020 14:40:50 +0900   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 05 Aug 2020 14:01:33 +0900   Thu, 30 Jul 2020 14:40:50 +0900   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 05 Aug 2020 14:01:33 +0900   Thu, 30 Jul 2020 14:40:52 +0900   KubeletReady                 kubelet is posting ready status

$ kubectl get node minikube -o yaml | grep -B 5 "type: UpFile"
  - lastHeartbeatTime: "2020-08-05T05:03:52Z"
    lastTransitionTime: "2020-08-05T04:18:44Z"
    message: /custom-data/up does not exist
    reason: UpFileDoesNotExist
    status: "True"
    type: UpFile

注册自定义插件并执行NPD。

一旦你准备好了自定义插件的设置和监视脚本之后，就可以将其注册到NPD并进行执行。以下是一个示例清单的部分说明。

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
spec:
    ...
    spec:
      containers:
      - name: node-problem-detector
        image:  "k8s.gcr.io/node-problem-detector:v0.8.2"
        command:
        - /node-problem-detector
        - --logtostderr
        - --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json
        - --prometheus-address=0.0.0.0
        - --prometheus-port=20257
        - --k8s-exporter-heartbeat-period=5m0s
        # カスタムプラグインの設定ファイルを指定します
        - --custom-plugin-monitors=/custom-config/custom-plugin-monitor.json
        volumeMounts:
        ...
        - name: custom-data
          mountPath: /custom-data
        - name: custom-config
          mountPath: /custom-config
          readOnly: true
        - name: custom-plugin
          mountPath: /custom-config/plugin
          readOnly: true
      ...
      # 今回の監視スクリプトはホスト上のファイルが存在するかどうかを確認するため、hostPath でマウントします
      - name: custom-data
        hostPath:
          path: /custom-data
          type: Directory
      # カスタムプラグインの設定ファイルは ConfigMap から取得しています
      - name: custom-config
        configMap:
          name: node-problem-detector-custom-config
      # ここではカスタムプラグインの監視スクリプトがシェルスクリプトなので ConfigMap から取得しています
      # なお、監視スクリプトは実行ファイルでなければならないので、configMap.defaultMode でファイルのモードを 555 などに設定しておくとよいでしょう
      # Go 言語で実装した場合は init-container で emptyDir で共有したディレクトリを使って NPD コンテナから参照できるようにします
      - name: custom-plugin
        configMap:
          name: node-problem-detector-custom-plugin
          defaultMode: 0555

原文：上記マニフェストファイルの完全版は https://github.com/superbrothers-sandbox/try-node-problem-detector-custom-plugin-monitor/blob/master/node-problem-detector.yaml にあります。minikube クラスタで試せる手順が https://github.com/superbrothers-sandbox/try-node-problem-detector-custom-plugin-monitor にあるので気になる方はやってみてください。

翻译：你可以在 https://github.com/superbrothers-sandbox/try-node-problem-detector-custom-plugin-monitor/blob/master/node-problem-detector.yaml 找到上述清单文件的完整版。如果你对此感兴趣，可以按照 https://github.com/superbrothers-sandbox/try-node-problem-detector-custom-plugin-monitor 上的指南在minikube集群上尝试一下。

我想通过更改节点的条件来执行操作。

通过 NPD 可以创建事件并更改节点条件，以便可以了解节点的状态，并通过监视来发现并解决问题。此外，如果已经确定解决问题的方法，那么我们应该希望将其自动化。

您可以使用planetlabs/draino通过更改特定节点的条件来对目标节点执行kubectl drain操作。该工具会根据节点的标签和条件自动地对目标节点进行drain处理。

可以使用pfnet-research/node-operation-controller来执行对特定条件下的节点进行任意处理的操作。使用该控制器可以自动重启出现问题的节点。该控制器实际上在PFN公司的集群中使用。

最后

在这里我们讲解了如何在NPD中使用自定义插件。NPD提供了各种方式来设置节点条件，非常便利。同时，您还可以将其与一些控制器简单结合，实现节点操作的自动化。请务必尝试使用。