【问题标题】:k8s: Liveness and Readiness probe failing on a multi-container podk8s:多容器 pod 上的 Liveness 和 Readiness 探测失败
【发布时间】:2021-11-12 13:45:45
【问题描述】:

我有一个在 AWS EKS 上运行的多容器 pod。一个运行在 80 端口的 Web 应用容器和一个运行在 6379 端口的 Redis 容器。

部署完成后,在集群内对 pod 的 IP 地址:端口进行手动 curl 探测始终是良好的响应。
服务的入口也很好。

但是,kubelet 的探测失败,导致重新启动,我不确定如何复制该探测失败或修复它。

感谢阅读!

以下是活动:

0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Readiness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Warning   Unhealthy                pod/app-7cddfb865b-gsvbg                                   Liveness probe failed: Get http://10.10.14.199:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
0s          Normal    Killing                  pod/app-7cddfb865b-gsvbg                                   Container app failed liveness probe, will be restarted
0s          Normal    Pulling                  pod/app-7cddfb865b-gsvbg                                   Pulling image "registry/app:latest"
0s          Normal    Pulled                   pod/app-7cddfb865b-gsvbg                                   Successfully pulled image "registry/app:latest"
0s          Normal    Created                  pod/app-7cddfb865b-gsvbg                                   Created container app

让事情变得通用,这是我的部署 yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "16"
  creationTimestamp: "2021-05-26T22:01:19Z"
  generation: 19
  labels:
    app: app
    chart: app-1.0.0
    environment: production
    heritage: Helm
    owner: acme
    release: app
  name: app
  namespace: default
  resourceVersion: "234691173"
  selfLink: /apis/apps/v1/namespaces/default/deployments/app
  uid: 3149acc2-031e-4719-89e6-abafb0bcdc3c
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: app
      release: app
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 100%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        kubectl.kubernetes.io/restartedAt: "2021-09-17T09:04:49-07:00"
      creationTimestamp: null
      labels:
        app: app
        environment: production
        owner: acme
        release: app
    spec:
      containers:
        - image: redis:5.0.6-alpine
          imagePullPolicy: IfNotPresent
          name: redis
          ports:
            - containerPort: 6379
              hostPort: 6379
              name: redis
              protocol: TCP
          resources:
            limits:
              cpu: 500m
              memory: 500Mi
            requests:
              cpu: 500m
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
        - env:
            - name: SYSTEM_ENVIRONMENT
              value: production
          envFrom:
            - configMapRef:
                name: app-production
            - secretRef:
                name: app-production
          image: registry/app:latest
          imagePullPolicy: Always
          livenessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 20
            successThreshold: 1
            timeoutSeconds: 1
          name: app
          ports:
            - containerPort: 80
              hostPort: 80
              name: app
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /
              port: 80
              scheme: HTTP
            initialDelaySeconds: 90
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          resources:
            limits:
              cpu: "1"
              memory: 500Mi
            requests:
              cpu: "1"
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      priorityClassName: critical-app
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  conditions:
    - lastTransitionTime: "2021-08-10T17:34:18Z"
      lastUpdateTime: "2021-08-10T17:34:18Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2021-05-26T22:01:19Z"
      lastUpdateTime: "2021-09-17T16:48:54Z"
      message: ReplicaSet "app-7f7cb8fd4" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
  observedGeneration: 19
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

这是我的服务 yaml:

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: "2021-05-05T20:11:33Z"
  labels:
    app: app
    chart: app-1.0.0
    environment: production
    heritage: Helm
    owner: acme
    release: app
  name: app
  namespace: default
  resourceVersion: "163989104"
  selfLink: /api/v1/namespaces/default/services/app
  uid: 1f54cd2f-b978-485e-a1af-984ffeeb7db0
spec:
  clusterIP: 172.20.184.161
  externalTrafficPolicy: Cluster
  ports:
    - name: http
      nodePort: 32648
      port: 80
      protocol: TCP
      targetPort: 80
  selector:
    app: app
    release: app
  sessionAffinity: None
  type: NodePort
status:
  loadBalancer: {}

2021 年 10 月 20 日更新:

所以我接受了建议,用这些慷慨的设置来修补就绪探测:

readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /
    port: 80
    scheme: HTTP
  initialDelaySeconds: 300
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 10

这些是事件:

5m21s       Normal    Scheduled                pod/app-686494b58b-6cjsq                                   Successfully assigned default/app-686494b58b-6cjsq to ip-10-10-14-127.compute.internal
5m20s       Normal    Created                  pod/app-686494b58b-6cjsq                                   Created container redis
5m20s       Normal    Started                  pod/app-686494b58b-6cjsq                                   Started container redis
5m20s       Normal    Pulling                  pod/app-686494b58b-6cjsq                                   Pulling image "registry/app:latest"
5m20s       Normal    Pulled                   pod/app-686494b58b-6cjsq                                   Successfully pulled image "registry/app:latest"
5m20s       Normal    Created                  pod/app-686494b58b-6cjsq                                   Created container app
5m20s       Normal    Pulled                   pod/app-686494b58b-6cjsq                                   Container image "redis:5.0.6-alpine" already present on machine
5m19s       Normal    Started                  pod/app-686494b58b-6cjsq                                   Started container app
0s          Warning   Unhealthy                pod/app-686494b58b-6cjsq                                   Readiness probe failed: Get http://10.10.14.117:80/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

虽然当我实际手动请求运行状况检查页面(根页面)时,我看到就绪探测开始起作用,这很奇怪。但即便如此,探测失败并不是因为容器运行不正常——它们是——而是其他地方。

【问题讨论】:

    标签: kubernetes amazon-eks


    【解决方案1】:

    让我们复习一下您的探测,以便您了解正在发生的事情并可能找到解决方法:

    
    ### Readiness probe - "waiting" for the container to be ready
    ### to get to work.
    ###
    
    ### Liveness is executed once the pod is running which means that
    ### you have passed the readinessProbe so you might want to start
    ### with the readinessProbe first
    
    
    livenessProbe:
    
      ### - Define how many retries to test the URL before restarting the pod.
      ### Try to increase this number and once your pod is restarted reduce
      ### it back to a lower value
      failureThreshold: 3
        httpGet:
          path: /
          port: 80
          scheme: HTTP
        ###
        ### Delay before executing the first test
        ### As before - try to increase the delay and reduce it 
        ### back when you figured out the correct value
        ###
        initialDelaySeconds: 90
    
        ### How often (in seconds) to perform the test.
        periodSeconds: 20
        successThreshold: 1
    
        ### Number of seconds after which the probe times out.
        ### Since the value is 1 I assume that you did not change it.
        ### Same as before - increase the value and figure out what
        ### the current value
        timeoutSeconds: 1
    
    
    ### Same comments as above + `initialDelaySeconds`
    ### Readiness is "waiting" for the container to be ready to
    ### get to work.
    
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /
        port: 80
        scheme: HTTP
    
      ### Again, nothing new here, same comments to increase the value
      ### and then reduce it until you figure out what is desired value
      ### for this probe
      initialDelaySeconds: 90
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    


    查看日志/事件

    • 如果您不确定探测是根本原因,请查看日志和事件以找出导致这些故障的根本原因

    【讨论】:

    • 我在“2021 年 10 月 20 日更新”部分下更新了我的帖子。谢谢。
    • 你带领我走上了正确的道路。我将 timeoutSeconds 提高到 30 秒,准备就绪和活跃度探测不再失败。虽然这是一个极端的设置。集群内部的手动 curl 会在 22 毫秒内返回健康检查页面。
    • 仍然是一个问题 :(。Readiness 探测后来失败了,这是在容器已经运行并开始接收流量之后的很长时间。
    • 您找到答案了吗?嗨,来这里是为了同样的问题。我的服务自 2020 年 10 月以来一直在运行,但最近在服务启动并运行一段时间后,就绪探测失败。我检查了“kubectl describe op”,它显示“Readiness probe failed”。我也有不切实际的大超时等。新部署出现,服务被识别为准备就绪,入口将流量转移到新部署。几分钟后,“Readiness probe failed”消息出现并且服务重新启动。
    【解决方案2】:
    猜你喜欢
    • 2020-06-19
    • 2021-06-09
    • 1970-01-01
    • 2020-04-22
    • 2019-07-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-02-11
    相关资源
    最近更新 更多