【问题标题】:How can I get details of GKE fluentbit related error?如何获取 GKE fluentbit 相关错误的详细信息?
【发布时间】:2021-06-07 08:35:45
【问题描述】:

我们刚刚发现 Stackdriver 缺少一些日志, 我们可以使用kubectl logs 列出日志消息,但其中一些由于某种原因没有发送到 Stackdriver 日志。
缺少日志条目的示例:

{"severity":"info","time":"2021-06-07T08:19:17.598Z","caller":"zap/options.go:212","msg":"finished unary call with code OK","grpc.start_time":"2021-06-07T08:19:17Z","system":"grpc","span.kind":"server","grpc.service":"manabie.tom.ChatService","grpc.method":"SendMessage","peer.address":"127.0.0.1:32806","userID":"xxxx","x-request-id":"xxxx","grpc.code":"OK","grpc.time_ms":48.04899978637695}

检查 fluentbit 守护进程:

kubectl logs fluentbit-gke-xxxx -c fluentbit-gke -f --tail=1 

我看到一些错误日志,例如:

W0607 08:16:55.066861       1 server.go:77] Received empty or invalid msgpack for tag kube_xxxxxxxx
W0607 08:16:59.072151       1 server.go:77] Received empty or invalid msgpack for tag kube_xxxxxxxx

描述守护程序集:

kubectl describe daemonset fluentbit-gke
Name:           fluentbit-gke
Selector:       component=fluentbit-gke,k8s-app=fluentbit-gke
Node-Selector:  kubernetes.io/os=linux
Labels:         addonmanager.kubernetes.io/mode=Reconcile
                k8s-app=fluentbit-gke
                kubernetes.io/cluster-service=true
Annotations:    deprecated.daemonset.template.generation: 9
Desired Number of Nodes Scheduled: 4
Current Number of Nodes Scheduled: 4
Number of Nodes Scheduled with Up-to-date Pods: 4
Number of Nodes Scheduled with Available Pods: 4
Number of Nodes Misscheduled: 0
Pods Status:  4 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:           component=fluentbit-gke
                    k8s-app=fluentbit-gke
                    kubernetes.io/cluster-service=true
  Annotations:      EnableNodeJournal: false
                    EnablePodSecurityPolicy: false
                    SystemOnlyLogging: false
                    components.gke.io/component-name: fluentbit
                    components.gke.io/component-version: 1.4.4
                    monitoring.gke.io/path: /api/v1/metrics/prometheus
  Service Account:  fluentbit-gke
  Containers:
   fluentbit:
    Image:      gke.gcr.io/fluent-bit:v1.5.7-gke.1
    Port:       2020/TCP
    Host Port:  2020/TCP
    Limits:
      memory:  250Mi
    Requests:
      cpu:        50m
      memory:     100Mi
    Liveness:     http-get http://:2020/ delay=120s timeout=1s period=60s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /fluent-bit/etc/ from config-volume (rw)
      /var/lib/docker/containers from varlibdockercontainers (ro)
      /var/lib/kubelet/pods from varlibkubeletpods (rw)
      /var/log from varlog (rw)
      /var/run/google-fluentbit/pos-files from varrun (rw)
   fluentbit-gke:
    Image:      gke.gcr.io/fluent-bit-gke-exporter:v0.16.2-gke.0
    Port:       2021/TCP
    Host Port:  2021/TCP
    Command:
      /fluent-bit-gke-exporter
      --kubernetes-separator=_
      --stackdriver-resource-model=k8s
      --enable-pod-label-discovery
      --pod-label-dot-replacement=_
      --split-stdout-stderr
      --logtostderr
    Limits:
      memory:  250Mi
    Requests:
      cpu:        50m
      memory:     100Mi
    Liveness:     http-get http://:2021/healthz delay=120s timeout=1s period=60s #success=1 #failure=3
    Environment:  <none>
    Mounts:       <none>
  Volumes:
   varrun:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/google-fluentbit/pos-files
    HostPathType:  
   varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
   varlibkubeletpods:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/pods
    HostPathType:  
   varlibdockercontainers:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker/containers
    HostPathType:  
   config-volume:
    Type:               ConfigMap (a volume populated by a ConfigMap)
    Name:               fluentbit-gke-config-v1.0.6
    Optional:           false
  Priority Class Name:  system-node-critical
Events:                 <none>

【问题讨论】:

    标签: kubernetes google-kubernetes-engine fluent-bit


    【解决方案1】:

    您可能会遇到某些日志超出 Cloud Logging API 大小限制。

    Fluentbit-gke 将其日志存储在每个节点的 /var/log/fluentbit.log 中,这些日志不会导出到 Cloud Logging。该目录是一个 hostPath 卷,它将 /var/log 从主机节点的文件系统挂载到 Pod 中。可以从主机本身访问日志文件。如果需要这些日志,请从节点获取 fluentbit 日志并提供副本:

    $ kubectl get nodes
    $ gcloud compute scp <node_name>:/var/log/fluentbit.log* ./
    

    与 Fluentd 不同,GKE 1.17 中的 Fluentbit 当前最大单个日志条目大小为 32K。这将导致 fluentbit 删除大小 > 32K 的用户日志,并且不会导出到 Cloud Logging。在 GKE 1.18 集群上,单个日志条目的大小已增加到 1MB。这是将被提取到 fluentbit 中的大小,但是,fluentbit 会将其削减到 200KB,以便为将添加到条目中的其他元数据留出一些空间,然后再将其写入 Cloud Logging。这是因为 Cloud Logging API 对 size of log entry 的限制为 256 KB。

    【讨论】:

    • 遗憾的是,我无权检查 GKE(托管集群)的日志,但我不认为这可能是大小调整的情况,因为丢失的消息有点短。 (在我的例子中或多或少)。而且我可以搜索更长的消息(例如带有堆栈跟踪的错误日志)。
    猜你喜欢
    • 1970-01-01
    • 2012-12-26
    • 1970-01-01
    • 2021-12-23
    • 2012-06-20
    • 2011-12-01
    • 2011-01-31
    • 2011-03-07
    • 1970-01-01
    相关资源
    最近更新 更多