我在使用EKS Fargate时遇到了Datadog的问题
首先
我想整理出在導入Datadog時,由於難以獲取基礎設施數據而遇到的困難點。
Datadog的架构
这是将Datadog引入Kubernetes的架构。
公式中描述了以下内容。
集群代理作为API服务器和基于节点的代理之间的代理,不仅可以减轻对API服务器的直接负载,还可以使基于节点的代理集中于节点级别的数据收集。另一方面,集群代理从主节点收集集群级别的数据。集群代理将集群级元数据发送回基于节点的代理,可以使用一致的标签在整个集群中增强本地收集的指标。另外,基于节点的代理不再需要从API服务器查询此数据,因此可以减少RBAC规则并且只从kubelet读取指标和元数据。
参考:https://www.datadoghq.com/ja/blog/datadog-cluster-agent
请参考上述链接,了解Datadog集群代理。
为什么会失败?
对kubelet的权限错误
NodeAgent在Fargate的情况下通过apiserver向节点进行代理,然后进行节点内的指标收集。
if kubeletProxyEnabled {
// Explicitly disable HTTP to reach APIServer
kubeletHTTPPort = 0
httpsPort, err := strconv.ParseUint(os.Getenv("KUBERNETES_SERVICE_PORT"), 10, 16)
if err != nil {
return nil, fmt.Errorf("unable to get APIServer port: %w", err)
}
kubeletHTTPSPort = int(httpsPort)
if config.Datadog.Get("kubernetes_kubelet_nodename") != "" {
kubeletPathPrefix = fmt.Sprintf("/api/v1/nodes/%s/proxy", kubeletNodeName)
apiServerHost := os.Getenv("KUBERNETES_SERVICE_HOST")
potentialHosts = &connectionInfo{
hostnames: []string{apiServerHost},
}
log.Infof("EKS on Fargate mode detected, will proxy calls to the Kubelet through the APIServer at %s:%d%s", apiServerHost, kubeletHTTPSPort, kubeletPathPrefix)
} else {
return nil, errors.New("kubelet proxy mode enabled but nodename is empty - unable to query")
}
} else {
// Building a list of potential ips/hostnames to reach Kubelet
potentialHosts = getPotentialKubeletHosts(kubeletHost)
}
// Checking HTTPS first if port available
var httpsErr error
if kubeletHTTPSPort > 0 {
httpsErr = checkKubeletConnection(ctx, "https", kubeletHTTPSPort, kubeletPathPrefix, potentialHosts, &clientConfig)
if httpsErr != nil {
log.Debug("Impossible to reach Kubelet through HTTPS")
if kubeletHTTPPort <= 0 {
return nil, httpsErr
}
} else {
return newForConfig(clientConfig, kubeletTimeout)
}
}
因此,我们将授予 ClusterRole 如下权限。
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: datadog-agent
rules:
- apiGroups:
- ""
resources:
- nodes
- namespaces
verbs:
- get
- list
- apiGroups:
- ""
resources:
- nodes/metrics
- nodes/spec
- nodes/stats
- nodes/proxy
- nodes/pods
- nodes/healthz
verbs:
- get
命名空间之间的通信
ClusterAgent使用”datadog”作为命名空间,NodeAgent使用”golang”作为命名空间,因此需要在不同的命名空间之间进行通信。为此,需要ClusterAgent的服务端点。因此,在DD_CLUSTER_AGENT_URL中进行了设置。
之所以使用svc.cluster.local作为默认的NameServer,是因为它无法解析名称。
- name: DD_CLUSTER_AGENT_URL
value: "https://datadog-agent-cluster-agent.datadog.svc.cluster.local:5005"
令牌不匹配
集群代理(ClusterAgent)和节点代理(NodeAgent)设置了不同的令牌,导致了403错误。
if len(tok) < 2 || tok[1] != GetAuthToken() {
err = fmt.Errorf("invalid session token")
http.Error(w, err.Error(), 403)
}
ClusterAgent会自动生成一个随机令牌用于与NodeAgent共享的默认密钥。
util.CreateAndSetAuthToken()
然而,如果在不同的Namespace中运营,生成的NodeAgent就无法从Secret中引用了。因此,我手动创建了一个token,并将其配置在ClusterAgent和NodeAgent中,以使其能够作为Secret被引用(通过clusterAgent.tokenExistingSecret的设置)。
集群代理在使用Helm和ArgoCD时配置如下。
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: datadog-agent
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: default
source:
repoURL: https://helm.datadoghq.com
targetRevision: 2.27.2
helm:
values: |
targetSystem: linux
datadog:
apiKeyExistingSecret: datadog-secrets
appKeyExistingSecret: datadog-secrets
leaderElection: true
collectEvents: true
kubeStateMetricsEnabled: false
kubeStateMetricsCore:
enabled: true
logLevel: INFO
apm:
enabled: true
processAgent:
enabled: true
processCollection: true
clusterAgent:
tokenExistingSecret: datadog-secrets
enabled: true
metricsProvider:
enabled: true
rbac:
create: true
clusterChecks:
enabled: true
clusterChecksRunner:
enabled: true
replicas: 2
rbac:
create: true
agents:
rbac:
create: true
parameters:
- name: "datadog.tags[0]"
value: "env:prd"
- name: "datadog.tags[1]"
value: "system:golang"
- name: "datadog.clusterName"
value: "golang-cluster"
chart: datadog
destination:
server: 'https://kubernetes.default.svc'
namespace: datadog
syncPolicy:
automated:
prune: true
selfHeal: true
---
apiVersion: external-secrets.io/v1alpha1
kind: SecretStore
metadata:
name: datadog
namespace: datadog
spec:
provider:
aws:
service: SecretsManager
region: ap-northeast-1
---
apiVersion: external-secrets.io/v1alpha1
kind: ExternalSecret
metadata:
name: datadog-secrets
namespace: datadog
spec:
refreshInterval: 1m
secretStoreRef:
name: datadog
kind: SecretStore
target:
name: datadog-secrets
creationPolicy: Owner
data:
- secretKey: api-key
remoteRef:
key: datadog/apikey
- secretKey: app-key
remoteRef:
key: datadog/appkey
# secretKeyはtokenにしなければなりません
- secretKey: token
remoteRef:
key: datadog/cluster-agent-token
在NodeAgent的一侧,配置如下。
- name: DD_CLUSTER_AGENT_AUTH_TOKEN
valueFrom:
secretKeyRef:
name: go-secrets
key: token
apiVersion: external-secrets.io/v1alpha1
kind: SecretStore
metadata:
name: golang
namespace: golang
spec:
provider:
aws:
service: SecretsManager
region: ap-northeast-1
---
apiVersion: external-secrets.io/v1alpha1
kind: ExternalSecret
metadata:
name: golang-secrets
namespace: golang
spec:
refreshInterval: 1m
secretStoreRef:
name: golang
kind: SecretStore
target:
name: golang-secrets
creationPolicy: Owner
data:
- secretKey: datadog_api_key
remoteRef:
key: datadog/apikey
# token
- secretKey: token
remoteRef:
key: datadog/cluster-agent-token
总结
在EKS的EC2配置中,可以很容易地通过DaemonSet进行引入,但在Fargate配置中,引入时需要考虑很多方面。此外,在EC2和Fargate的多重配置中,我认为还需要考虑安全组规则等因素,这让我感到意外地艰难。
其他
調查使用了以下命令:
datadog-cluster-agent status
cat /var/log/datadog/datadog-cluster-agent.log
agent status
cat /var/log/datadog/agent.log
文献引用
- Amazon EKS on AWS Fargate
- Cluster Agentのドキュメント