我在 OpenShift 上尝试了渐进式交付！（应用 Argo Rollouts 高级篇 – 与 Prometheus 的度量指标进行协同）

3 年 ago

科, 颖

3 minutes

背景 – Background

由于本文是以下文章的延续，请查阅以下文章了解背景信息。
在OpenShift上尝试进行渐进式部署！（基础篇-使用 Argo Rollouts CLI）
在OpenShift上尝试进行渐进式部署！（基本篇-使用 Argo Rollouts GUI）

前提 (in Chinese)

以下の記事に記載されている手順でArgo Rolloutsの導入が完了していること

尝试使用 OpenShift 进行渐进式交付！（Argo Rollouts CLI 基础教程）

以下の記事に記載されている手順でOpenShiftクラスターにPrometheus Operatorが導入されていること

我在 OpenShift v4.6 中引入 Prometheus Operator 来进行用户定义项目的监控！

请试试看

切换到验证项目。

只使用Prometheus Operator将OpenShift v4.6中的自定义项目监控，参照“OpenShift v4.6にPrometheus Operatorを導入してユーザー定義プロジェクトのモニタリングに使ってみた！”中的步骤进行验证。

PS D:\git> oc project prometheus-operator
Now using project "prometheus-operator" on server "https://c103-e.us-south.containers.cloud.ibm.com:31989".
PS D:\git>

创建AnalysisTemplate的清单文件

- 以下の設定でAnalysisTemplateのリソースを定義
    - successCondition: 以下のqueryの結果が5回以下
    - provider:prometheus
        - address: [OpenShift v4.6にPrometheus Operatorを導入してユーザー定義プロジェクトのモニタリングに使ってみた！](https://qiita.com/strada/items/23bccc596be587fe003c)の手順で公開したPrometheus Serviceのエンドポイント
        - query: 以下の条件に合致するHTTPリクエストの増分
            - HTTP応答コードが200番以外
            - namespace="kubota-test"のservice="rollout-canary"が対象

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: failure-count
spec:
  metrics:
  - name: http_error_count
    successCondition: result[0] <= 5
    provider:
      prometheus:
        address: "http://prometheus.prometheus-operator.svc:9090"
        query: |
          delta(http_requests_total{code!~"200",namespace="prometheus-operator",service="rollout-canary"}[5m])

创建Rollout、Service、ServiceMonitor的清单文件。

strategy: canary

steps:

20%をcanaryに振る
一時停止
promoteされたらfailure-countのAnalysisTemplateの分析を実行
成功したら40%をcanaryに振る
40秒待機
60%をcanaryに振る
20秒待機
80%をcanaryに振る
20秒待機
100%をcanaryに振り、Rollout完了し、canary->stableになる。

# This example demonstrates a Rollout using the canary update strategy with a customized rollout
# plan. The prescribed steps initially sets a canary weight of 20%, then pauses indefinitely. Once
# resumed, the rollout performs a gradual, automated 20% weight increase until it reaches 100%.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: rollout-canary
spec:
  replicas: 5
  revisionHistoryLimit: 2
  selector:
    matchLabels:
      app: rollout-canary
  template:
    metadata:
      labels:
        app: rollout-canary
    spec:
      containers:
      - name: rollouts-demo
        image: quay.io/brancz/prometheus-example-app:v0.2.0
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
  strategy:
    canary:
      steps:
      - setWeight: 20
      # The following pause step will pause the rollout indefinitely until manually resumed.
      # Rollouts can be manually resumed by running `kubectl argo rollouts promote ROLLOUT`
      - pause: {}
      - analysis:
          templates:
          - templateName: failure-count
      - setWeight: 40
      - pause: {duration: 40s}
      - setWeight: 60
      - pause: {duration: 20s}
      - setWeight: 80
      - pause: {duration: 20s}
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: rollout-canary
  name: rollout-canary
spec:
  ports:
  - port: 8080
    protocol: TCP
    targetPort: 8080
    name: web
  selector:
    app: rollout-canary
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    team: frontend
    k8s-app: rollout-canary
  name: rollout-canary
spec:
  endpoints:
  - interval: 30s
    port: web
    scheme: http
  selector:
    matchLabels:
      app: rollout-canary

应用清单文件

PS D:\git> oc apply -f .\prometheus-anlysis-template.yaml
analysistemplate.argoproj.io/failure-count created
PS D:\git> oc apply -f .\rollout-canary.yaml
rollout.argoproj.io/rollout-canary created
service/rollout-canary created
servicemonitor.monitoring.coreos.com/rollout-canary created
PS D:\git>

启动UI仪表板

PS D:\git> docker run -p 3100:3100 -v C:\Users\YASUYUKIKUBOTA\.kube\config:/.kube/config quay.io/argoproj/kubectl-argo-rollouts:master dashboard --insecure-skip-tls-verify
time="2021-07-13T07:14:38Z" level=info msg="Argo Rollouts Dashboard is now available at localhost 3100"

登录仪表板并显示Rollout。

localhost:3100にアクセスし、NSで作成した検証用プロジェクトを選択

首次的推出将在没有分析的情况下完成。

访问示例应用程序

以下のコマンドを実行し、rollout-canary Serviceの8080ポートをlocalhostに転送

PS D:\git> oc port-forward svc/rollout-canary 8080:8080
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080

localhost:8080/errに2回ほどアクセス

※ 画面上不需要显示任何内容就可以了。

访问Prometheus用户界面

以下のコマンドを実行し、Prometheus Serviceの9090ポートをlocalhostに転送

PS D:\git> oc port-forward svc/prometheus 9090:9090 -n prometheus-operator
Forwarding from 127.0.0.1:9090 -> 9090
Forwarding from [::1]:9090 -> 9090

localhost:9090にアクセスし以下のクエリーを実行

http_requests_total{code!~"200",namespace="prometheus-operator",service="rollout-canary"}

如果能够显示对结果/错误的访问次数，那就可以了。

AnalysisTemplateで使用する実際のクエリーを実行

delta(http_requests_total{code!~"200",namespace="prometheus-operator",service="rollout-canary"}[5m])

由于最近5分钟内没有出现404错误的增量，所以应该显示为0。
※如果访问时机恰好是在Prometheus抓取之前或之后，则可能显示为1。

执行Rollout（正常情况下）。

Dashboard(localhost:3100)にアクセスしContainersのrollout-demoのタグをv0.3.0へ変更しRolloutを開始

詳細手順はOpenShiftでProgressive Deliveryやってみた！（Argo Rollouts基本編-GUI）を参照

20%がcanaryに割り振られたところで一時停止したらPROMOTEをクリック

以下のようにAnalysis Runが成功し、しばらくするとRolloutが完了

进行Rollout（在分析中失败并自动回滚）。

delta(http_requests_total{code!~"200",namespace="prometheus-operator",service="rollout-canary"}[5m])

如果此查询的结果不超过5，则分析运行成功，并确认结果是否超过了6。

Dashboard(localhost:3100)にアクセスしPROMOTEをクリックするとAnalysis Runが失敗する

总结

我通过在 Argo Rollouts 的 AnalysisTemplate 中指定 Prometheus 的查询，并根据获取的结果决定是否自动推进 Rollout 或中止它的能力已经得到证实。然而，这次尝试也揭示了以下问题。

ServiceMonitorではなくPodMonitorを使用すれば解決する？

为了在实际场景中使用，必须解决这些问题，但至少确认了基本操作，所以如果有对这个领域感兴趣的客户，我将积极提出建议。