k8s：多容器 pod 上的 Liveness 和 Readiness 探测失败答案

【问题标题】：k8s: Liveness and Readiness probe failing on a multi-container podk8s：多容器 pod 上的 Liveness 和 Readiness 探测失败
【发布时间】：2021-11-12 13:45:45
【问题描述】：

我有一个在 AWS EKS 上运行的多容器 pod。一个运行在 80 端口的 Web 应用容器和一个运行在 6379 端口的 Redis 容器。

部署完成后，在集群内对 pod 的 IP 地址：端口进行手动 curl 探测始终是良好的响应。
服务的入口也很好。

但是，kubelet 的探测失败，导致重新启动，我不确定如何复制该探测失败或修复它。

感谢阅读！

以下是活动：

0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Normal    Killing                  pod/app-7cddfb865b-gsvbg                                   Container app failed liveness probe, will be restarted
0s          Normal    Pulling                  pod/app-7cddfb865b-gsvbg                                   Pulling image "registry/app:latest"
0s          Normal    Pulled                   pod/app-7cddfb865b-gsvbg                                   Successfully pulled image "registry/app:latest"
0s          Normal    Created                  pod/app-7cddfb865b-gsvbg                                   Created container app

让事情变得通用，这是我的部署 yaml：

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "16"
  creationTimestamp: "2021-05-26T22:01:19Z"
  generation: 19
  labels:
    app: app
    chart: app-1.0.0
    environment: production
    heritage: Helm
    owner: acme
    release: app
  name: app
  namespace: default
  resourceVersion: "234691173"
  selfLink: /apis/apps/v1/namespaces/default/deployments/app
  uid: 3149acc2-031e-4719-89e6-abafb0bcdc3c
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: app
      release: app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 100%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2021-09-17T09:04:49-07:00"
      creationTimestamp: null
      labels:
        app: app
        environment: production
        owner: acme
        release: app
    spec:
      containers:
        - image: redis:5.0.6-alpine
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - containerPort: 6379
              hostPort: 6379
              name: redis
              protocol: TCP
          resources:
            limits:
              cpu: 500m
              memory: 500Mi
            requests:
              cpu: 500m
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        - env:
            - name: SYSTEM_ENVIRONMENT
              value: production
          envFrom:
            - configMapRef:
                name: app-production
            - secretRef:
                name: app-production
          image: registry/app:latest
          imagePullPolicy: Always
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 20
            successThreshold: 1
            timeoutSeconds: 1
          name: app
          ports:
            - containerPort: 80
              hostPort: 80
              name: app
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: "1"
              memory: 500Mi
            requests:
              cpu: "1"
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      priorityClassName: critical-app
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: "2021-08-10T17:34:18Z"
      lastUpdateTime: "2021-08-10T17:34:18Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2021-05-26T22:01:19Z"
      lastUpdateTime: "2021-09-17T16:48:54Z"
      message: ReplicaSet "app-7f7cb8fd4" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
  observedGeneration: 19
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

这是我的服务 yaml：

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2021-05-05T20:11:33Z"
  labels:
    app: app
    chart: app-1.0.0
    environment: production
    heritage: Helm
    owner: acme
    release: app
  name: app
  namespace: default
  resourceVersion: "163989104"
  selfLink: /api/v1/namespaces/default/services/app
  uid: 1f54cd2f-b978-485e-a1af-984ffeeb7db0
spec:
  clusterIP: 172.20.184.161
  externalTrafficPolicy: Cluster
  ports:
    - name: http
      nodePort: 32648
      port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: app
    release: app
  sessionAffinity: None
  type: NodePort
status:
  loadBalancer: {}

2021 年 10 月 20 日更新：

所以我接受了建议，用这些慷慨的设置来修补就绪探测：

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /
    port: 80
    scheme: HTTP
  initialDelaySeconds: 300
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 10

这些是事件：

5m21s       Normal    Scheduled                pod/app-686494b58b-6cjsq                                   Successfully assigned default/app-686494b58b-6cjsq to ip-10-10-14-127.compute.internal
5m20s       Normal    Created                  pod/app-686494b58b-6cjsq                                   Created container redis
5m20s       Normal    Started                  pod/app-686494b58b-6cjsq                                   Started container redis
5m20s       Normal    Pulling                  pod/app-686494b58b-6cjsq                                   Pulling image "registry/app:latest"
5m20s       Normal    Pulled                   pod/app-686494b58b-6cjsq                                   Successfully pulled image "registry/app:latest"
5m20s       Normal    Created                  pod/app-686494b58b-6cjsq                                   Created container app
5m20s       Normal    Pulled                   pod/app-686494b58b-6cjsq                                   Container image "redis:5.0.6-alpine" already present on machine
5m19s       Normal    Started                  pod/app-686494b58b-6cjsq                                   Started container app
0s          Warning   Unhealthy                pod/app-686494b58b-6cjsq                                   Readiness probe failed: Get http://10.10.14.117:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

虽然当我实际手动请求运行状况检查页面（根页面）时，我看到就绪探测开始起作用，这很奇怪。但即便如此，探测失败并不是因为容器运行不正常——它们是——而是其他地方。

【问题讨论】：

标签： kubernetes amazon-eks

【解决方案1】：

让我们复习一下您的探测，以便您了解正在发生的事情并可能找到解决方法：


### Readiness probe - "waiting" for the container to be ready
### to get to work.
###

### Liveness is executed once the pod is running which means that
### you have passed the readinessProbe so you might want to start
### with the readinessProbe first


livenessProbe:

  ### - Define how many retries to test the URL before restarting the pod.
  ### Try to increase this number and once your pod is restarted reduce
  ### it back to a lower value
  failureThreshold: 3
    httpGet:
      path: /
      port: 80
      scheme: HTTP
    ###
    ### Delay before executing the first test
    ### As before - try to increase the delay and reduce it 
    ### back when you figured out the correct value
    ###
    initialDelaySeconds: 90

    ### How often (in seconds) to perform the test.
    periodSeconds: 20
    successThreshold: 1

    ### Number of seconds after which the probe times out.
    ### Since the value is 1 I assume that you did not change it.
    ### Same as before - increase the value and figure out what
    ### the current value
    timeoutSeconds: 1


### Same comments as above + `initialDelaySeconds`
### Readiness is "waiting" for the container to be ready to
### get to work.

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /
    port: 80
    scheme: HTTP

  ### Again, nothing new here, same comments to increase the value
  ### and then reduce it until you figure out what is desired value
  ### for this probe
  initialDelaySeconds: 90
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1

查看日志/事件

如果您不确定探测是根本原因，请查看日志和事件以找出导致这些故障的根本原因

【讨论】：

我在“2021 年 10 月 20 日更新”部分下更新了我的帖子。谢谢。
你带领我走上了正确的道路。我将 timeoutSeconds 提高到 30 秒，准备就绪和活跃度探测不再失败。虽然这是一个极端的设置。集群内部的手动 curl 会在 22 毫秒内返回健康检查页面。
仍然是一个问题 :(。Readiness 探测后来失败了，这是在容器已经运行并开始接收流量之后的很长时间。
您找到答案了吗？嗨，来这里是为了同样的问题。我的服务自 2020 年 10 月以来一直在运行，但最近在服务启动并运行一段时间后，就绪探测失败。我检查了“kubectl describe op”，它显示“Readiness probe failed”。我也有不切实际的大超时等。新部署出现，服务被识别为准备就绪，入口将流量转移到新部署。几分钟后，“Readiness probe failed”消息出现并且服务重新启动。

【解决方案2】：

链接我的答案 liveness and readiness probe for multiple containers in a pod

【讨论】：