Kubernetes http 健康检查未按预期工作 - 500 响应被忽略答案

【问题标题】：Kubernetes http health check not working as expected - 500 response is ignoredKubernetes http 健康检查未按预期工作 - 500 响应被忽略
【发布时间】：2021-05-09 20:50:29
【问题描述】：

我已经为我的 pod 实施了 http 健康检查和单独的 http liveness 检查。对于这两种情况，如果我的 pod 在响应之前出现延迟，我发现 Kubernetes 会按预期工作。但是，当他们立即以状态 500 响应时，Kubernetes 将其视为成功响应。这是在 pod 启动并运行正常之后 - 在检查开始返回状态 500 之前。

事实上，我看到返回状态 500 实际上重置了失败计数，因此它导致我的 pod 再次被视为健康。

问题是我是否做错了什么？当我的 pod 不健康时，如何让 Kubernetes 完成它的工作？

$ k version
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-18T12:00:47Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

为了调查这个问题，我在我的 pod 中添加了测试端点，以便我可以在运行时更改行为：通过 (200)、失败 (500)、延迟失败（等待 15 秒，然后返回 500）。我将健康和活跃度端点分开。

来自描述 pod：

Liveness:   exec [curl http://localhost:30030/livez] delay=10s timeout=1s period=10s #success=1 #failure=5
Readiness:  exec [curl http://localhost:30030/healthz] delay=10s timeout=1s period=10s #success=1 #failure=3

我通过 exec 进入 pod 测试了端点，并从那里卷曲端点（详情如下）。
然后我通过 3 种模式循环了 liveness check 和 health check 并监控了 Kubernetes 的响应。
健康检查：预计 pod 会在连续 5 次健康检查失败后重新启动。
Liveness Check：描述服务并期望 Pod 的 IP 地址从端点列表中删除。

成功案例：

bash-4.4$ curl http://localhost:30030/unfailhealth
unfailhealth: REMOVE force all health checks to fail, was failHealth=false, delayFailHealth=false

bash-4.4$ curl http://localhost:30030/healthz -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 3
< ETag: W/"3-CftlTBfMBbEe9TvTWqcB9tVQ6OE"
< Date: Fri, 05 Feb 2021 13:30:59 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
OK
* Connection #0 to host localhost left intact

失败案例：

bash-4.4$ curl http://localhost:30030/failhealth
failhealth: force all health checks to fail, was failHealth=true, delayFailHealth=false

bash-4.4$ curl http://localhost:30030/healthz -v
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 26
< ETag: W/"1a-yI5D4Rtao1KH34GZVYKKvxZoEVo"
< Date: Fri, 05 Feb 2021 13:29:14 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE
* Connection #0 to host localhost left intact

延迟失败案例：

bash-4.4$ curl http://localhost:30030/delayfailhealth
delayfailhealth: force all health checks to sleep 15sec, then fail, was failHealth=false, delayFailHealth=true

bash-4.4$ date; curl http://localhost:30030/healthz -v
Fri Feb  5 13:33:08 UTC 2021
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 30030 (#0)
> GET /healthz HTTP/1.1
> Host: localhost:30030
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< X-Powered-By: Express
< Content-Type: text/html; charset=utf-8
< Content-Length: 47
< ETag: W/"2f-n+Ix8oU/09OT9+cpPVm1/EejE9Y"
< Date: Fri, 05 Feb 2021 13:33:23 GMT
< Connection: keep-alive
< Keep-Alive: timeout=5
<
FAKE HEALTH CHECK FAILURE - AFTER 15 SEC DELAY
* Connection #0 to host localhost left intact

测试结果

对于健康和活跃的端点默认为 SUCCESS，返回状态 200 -> pod 启动并且工作正常。

将 liveness check 设置为 FAIL，返回状态 500 -> 没有变化，Pod IP 仍在服务中，请求仍然分派到 Pod。
在响应之前将活动检查设置为 DELAY（然后是 500）-> 从 Kubernetes 服务中删除 pod (yippee)
再次（快速）将 liveness check 设置为 FAIL -> pod 恢复到服务（视为成功）。

将健康检查设置为 FAIL（返回状态 500）-> 无效，pod 继续运行，无需重启。
在响应之前将健康检查设置为 DELAY（然后是 500）-> 5 次失败探测后重新启动 pod

感谢您对此的任何帮助。我想我可以在失败情况下响应之前将我的代码更改为延迟，但这似乎是一种解决方法。

【问题讨论】：

(a) Pod liveness 和 readiness 有 httpGet: 可用，这避免了为该操作生成 curl 的需要，从而避免了简单的错误，例如 (b) 在没有 -f 的情况下运行 curl 会导致它到exit 0，无论服务器响应代码是什么（c）这不是编程问题，因此属于ServerFault.com
(a), (b) 明白 - 现在测试。 (c) 我按照 Kubernetes 文档中的指示在 kubernetes.io/docs/tasks/debug-application-cluster/... > Kubernetes 团队还将监控标记为 Kubernetes 的帖子。如果现有问题没有任何帮助，请提出新问题！

标签： kubernetes

【解决方案1】：

感谢@mdaniel 的评论，问题得以解决。在这里扩展它，因为我花了一段时间才完全理解评论。

问题出在 pod 规范中的运行状况和活性检查配置中。

        readinessProbe:
          exec:
            command:
            - curl
            - http://localhost:30030/healthz
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

这依赖于 exec 子句中 curl 命令的输出。
Curl 总是以代码 0 退出。如果您想使用 curl，请使用 curl -f。然后它会在出错的情况下以非零值退出。

但最好在 pod 规范中使用 httpGet，像这样

        readinessProbe:
          httpGet:
            path: /healthz
            port: 30030
            scheme: HTTP
          failureThreshold: 3
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1

我测试了两者并且都有效。我会按照建议使用httpGet - 适合这项工作的工具。

请注意，使用 exec/curl 而不是 httpGet 的原因是 pod 使用了 TLS，这会阻止来自 Kubernetes pod 的 http。参考。 https://medium.com/cloud-native-the-gathering/kubernetes-liveness-probe-for-scratch-image-with-istio-mtls-enabled-90543e4bae34

谢谢！

【讨论】：

我很高兴这很简单；请注意，您可以接受自己的答案以表明该答案解决了您的问题 :-) 对于那些坚持使用 curl -f 的人，他们也将从使用 curl -sf 中受益，这样可以避免 @987654330 的喋喋不休的 curl 回复@输出