我在使用EKS Fargate时遇到了Datadog的问题

首先

我想整理出在導入Datadog時,由於難以獲取基礎設施數據而遇到的困難點。

Datadog的架构

这是将Datadog引入Kubernetes的架构。

undefined

公式中描述了以下内容。

集群代理作为API服务器和基于节点的代理之间的代理,不仅可以减轻对API服务器的直接负载,还可以使基于节点的代理集中于节点级别的数据收集。另一方面,集群代理从主节点收集集群级别的数据。集群代理将集群级元数据发送回基于节点的代理,可以使用一致的标签在整个集群中增强本地收集的指标。另外,基于节点的代理不再需要从API服务器查询此数据,因此可以减少RBAC规则并且只从kubelet读取指标和元数据。

参考:https://www.datadoghq.com/ja/blog/datadog-cluster-agent

请参考上述链接,了解Datadog集群代理。

为什么会失败?

对kubelet的权限错误

NodeAgent在Fargate的情况下通过apiserver向节点进行代理,然后进行节点内的指标收集。

	if kubeletProxyEnabled {
		// Explicitly disable HTTP to reach APIServer
		kubeletHTTPPort = 0
		httpsPort, err := strconv.ParseUint(os.Getenv("KUBERNETES_SERVICE_PORT"), 10, 16)
		if err != nil {
			return nil, fmt.Errorf("unable to get APIServer port: %w", err)
		}
		kubeletHTTPSPort = int(httpsPort)

		if config.Datadog.Get("kubernetes_kubelet_nodename") != "" {
			kubeletPathPrefix = fmt.Sprintf("/api/v1/nodes/%s/proxy", kubeletNodeName)
			apiServerHost := os.Getenv("KUBERNETES_SERVICE_HOST")

			potentialHosts = &connectionInfo{
				hostnames: []string{apiServerHost},
			}
			log.Infof("EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at %s:%d%s", apiServerHost, kubeletHTTPSPort, kubeletPathPrefix)
		} else {
			return nil, errors.New("kubelet proxy mode enabled but nodename is empty - unable to query")
		}
	} else {
		// Building a list of potential ips/hostnames to reach Kubelet
		potentialHosts = getPotentialKubeletHosts(kubeletHost)
	}

	// Checking HTTPS first if port available
	var httpsErr error
	if kubeletHTTPSPort > 0 {
		httpsErr = checkKubeletConnection(ctx, "https", kubeletHTTPSPort, kubeletPathPrefix, potentialHosts, &clientConfig)
		if httpsErr != nil {
			log.Debug("Impossible to reach Kubelet through HTTPS")
			if kubeletHTTPPort <= 0 {
				return nil, httpsErr
			}
		} else {
			return newForConfig(clientConfig, kubeletTimeout)
		}
	}

因此,我们将授予 ClusterRole 如下权限。

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: datadog-agent
rules:
  - apiGroups:
    - ""
    resources:
    - nodes
    - namespaces
    verbs:
    - get
    - list
  - apiGroups:
      - ""
    resources:
      - nodes/metrics
      - nodes/spec
      - nodes/stats
      - nodes/proxy
      - nodes/pods
      - nodes/healthz
    verbs:
      - get

命名空间之间的通信

ClusterAgent使用”datadog”作为命名空间,NodeAgent使用”golang”作为命名空间,因此需要在不同的命名空间之间进行通信。为此,需要ClusterAgent的服务端点。因此,在DD_CLUSTER_AGENT_URL中进行了设置。

之所以使用svc.cluster.local作为默认的NameServer,是因为它无法解析名称。

- name: DD_CLUSTER_AGENT_URL
  value: "https://datadog-agent-cluster-agent.datadog.svc.cluster.local:5005"

令牌不匹配

集群代理(ClusterAgent)和节点代理(NodeAgent)设置了不同的令牌,导致了403错误。

if len(tok) < 2 || tok[1] != GetAuthToken() {
    err = fmt.Errorf("invalid session token")
    http.Error(w, err.Error(), 403)
}

ClusterAgent会自动生成一个随机令牌用于与NodeAgent共享的默认密钥。

util.CreateAndSetAuthToken()

然而,如果在不同的Namespace中运营,生成的NodeAgent就无法从Secret中引用了。因此,我手动创建了一个token,并将其配置在ClusterAgent和NodeAgent中,以使其能够作为Secret被引用(通过clusterAgent.tokenExistingSecret的设置)。

集群代理在使用Helm和ArgoCD时配置如下。

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: datadog-agent
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://helm.datadoghq.com
    targetRevision: 2.27.2
    helm:
      values: |
        targetSystem: linux
        datadog:
          apiKeyExistingSecret: datadog-secrets
          appKeyExistingSecret: datadog-secrets
          leaderElection: true
          collectEvents: true
          kubeStateMetricsEnabled: false
          kubeStateMetricsCore:
            enabled: true
          logLevel: INFO
          apm:
            enabled: true
          processAgent:
            enabled: true
            processCollection: true
        clusterAgent:
          tokenExistingSecret: datadog-secrets
          enabled: true
          metricsProvider:
            enabled: true
          rbac:
            create: true
        clusterChecks:
          enabled: true
        clusterChecksRunner:
          enabled: true
          replicas: 2
          rbac:
            create: true
        agents:
          rbac:
            create: true
      parameters:
        - name: "datadog.tags[0]"
          value: "env:prd"
        - name: "datadog.tags[1]"
          value: "system:golang"
        - name: "datadog.clusterName"
          value: "golang-cluster"
    chart: datadog
  destination:
    server: 'https://kubernetes.default.svc'
    namespace: datadog
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
---
apiVersion: external-secrets.io/v1alpha1
kind: SecretStore
metadata:
  name: datadog
  namespace: datadog
spec:
  provider:
    aws:
      service: SecretsManager
      region: ap-northeast-1
---
apiVersion: external-secrets.io/v1alpha1
kind: ExternalSecret
metadata:
  name: datadog-secrets
  namespace: datadog
spec:
  refreshInterval: 1m
  secretStoreRef:
    name: datadog
    kind: SecretStore
  target:
    name: datadog-secrets
    creationPolicy: Owner
  data:
  - secretKey: api-key
    remoteRef:
      key: datadog/apikey
  - secretKey: app-key
    remoteRef:
      key: datadog/appkey
  # secretKeyはtokenにしなければなりません
  - secretKey: token
    remoteRef:
      key: datadog/cluster-agent-token

在NodeAgent的一侧,配置如下。

 - name: DD_CLUSTER_AGENT_AUTH_TOKEN
   valueFrom:
     secretKeyRef:
       name: go-secrets
       key: token
apiVersion: external-secrets.io/v1alpha1
kind: SecretStore
metadata:
  name: golang
  namespace: golang
spec:
  provider:
    aws:
      service: SecretsManager
      region: ap-northeast-1
---
apiVersion: external-secrets.io/v1alpha1
kind: ExternalSecret
metadata:
  name: golang-secrets
  namespace: golang
spec:
  refreshInterval: 1m
  secretStoreRef:
    name: golang
    kind: SecretStore
  target:
    name: golang-secrets
    creationPolicy: Owner
  data:
  - secretKey: datadog_api_key
    remoteRef:
      key: datadog/apikey
  # token
  - secretKey: token
    remoteRef:
      key: datadog/cluster-agent-token

总结

在EKS的EC2配置中,可以很容易地通过DaemonSet进行引入,但在Fargate配置中,引入时需要考虑很多方面。此外,在EC2和Fargate的多重配置中,我认为还需要考虑安全组规则等因素,这让我感到意外地艰难。

其他

調查使用了以下命令:

datadog-cluster-agent status
cat /var/log/datadog/datadog-cluster-agent.log
agent status
cat /var/log/datadog/agent.log

文献引用

 

    Amazon EKS on AWS Fargate

 

    Cluster Agentのドキュメント

 

广告
将在 10 秒后关闭
bannerAds