pod 上的 Prometheus/Grafana 警报卡在挂起状态答案

【问题标题】：Prometheus/Graphana Alerting on pod stuck in pending statepod 上的 Prometheus/Grafana 警报卡在挂起状态
【发布时间】：2020-12-19 17:16:34
【问题描述】：

我是运行 Prometheus 和 Graphana 的新手。我想创建一个警报，当 Kubernetes pod 处于挂起状态超过 15 分钟时触发。我使用的 PromQL 查询是：

kube_pod_status_phase{exported_namespace="mynamespace", phase="Pending"} > 0

我无法弄清楚的是如何根据 pod 处于该状态的时间来构建警报。我在 Graphana 中尝试了几种警报条件的排列方式：

当查询（A，15m，现在）的 avg() 高于 1

all 根据状态中的 pod 数量而不是持续时间触发警报。

如何根据状态时间构建警报？

请，谢谢

【问题讨论】：

有一个 alertmanager 规则资源，其中有一个 pending PVCs 看起来接近你想要的，但我没有方便的 grafana 将其转换为你的语法

标签： kubernetes prometheus grafana promql grafana-alerts

【解决方案1】：

- alert: KubernetesPodNotHealthy
expr: min_over_time(sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[15m:1m]) > 0
for: 0m
labels:
  severity: critical
annotations:
  summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
  description: "Pod has been in a non-ready state for longer than 15 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

【讨论】：