如何使用配置文件使 Prometheus Alertmanager 静音？答案

【问题标题】：How to silence Prometheus Alertmanager using config files?如何使用配置文件使 Prometheus Alertmanager 静音？
【发布时间】：2019-07-15 07:45:21
【问题描述】：

我正在使用官方的stable/prometheus-operator 图表来部署 Prometheus with helm。

到目前为止，它运行良好，除了针对许多 pod（包括自己的 Prometheus 的 config-reloaders containers）触发的烦人的 CPUThrottlingHigh 警报。此警报是 currently under discussion，我想暂时将其通知静音。

Alertmanager 有一个silence feature，但它是基于网络的：

静音是一种简单的方法，可以简单地将给定的警报静音时间。静音在 Web 界面中配置警报管理器。

有没有办法使用配置文件将来自CPUThrottlingHigh 的通知静音？

【问题讨论】：

stackoverflow.com/questions/53277194/…
@c4f4t0r 谢谢，我读到了cfs 和throttle 指标的含义，但是警报本身及其阈值仍然存在争议并且意见分歧......现在，我只想在不依赖 AlertManager Web 界面的情况下使其静音。
删除promethues config的规则
@c4f4t0r prometheus-operator 图表从kubernetes-mixin 导入 k8s 规则/警报。没有合适的方法只禁用CPUThrottlingHigh 规则；全有或全无（通过defaultRules.rules.k8s helm config 参数）

标签： kubernetes prometheus prometheus-alertmanager prometheus-operator

【解决方案1】：

我怀疑是否存在通过配置使警报静音的方法（除了将所述警报路由到/dev/null 接收器，即没有配置电子邮件或任何其他通知机制的接收器，但警报仍会显示在 Alertmanager UI 中)。

您显然可以使用警报管理器附带的command line tool amtool 添加静音（尽管我看不到设置静音过期时间的方法）。

或者您可以直接使用 API（即使它没有记录在案并且理论上它可能会改变）。根据this prometheus-users thread，这应该可以：

curl https://alertmanager/api/v1/silences -d '{
      "matchers": [
        {
          "name": "alername1",
          "value": ".*",
          "isRegex": true
        }
      ],
      "startsAt": "2018-10-25T22:12:33.533330795Z",
      "endsAt": "2018-10-25T23:11:44.603Z",
      "createdBy": "api",
      "comment": "Silence",
      "status": {
        "state": "active"
      }

}'

【讨论】：

感谢您的提示。路由到/dev/null 对我不起作用，因为我将所有触发警报连接在一起以在单个 Slack 消息中接收它们（例如 this）。我创建了一个hackish inhibitor_rule 来通过配置文件管理它。请阅读我的回答，如果可以的话，请给我你的想法:)

【解决方案2】：

好吧，我通过配置一个 hackish inhibit_rule 来管理它：

inhibit_rules:
- target_match:
     alertname: 'CPUThrottlingHigh'
  source_match:
     alertname: 'DeadMansSwitch'
  equal: ['prometheus']

DeadMansSwitch 在设计上是 prometheus-operator 附带的“始终触发”警报，prometheus 标签是所有警报的通用标签，因此CPUThrottlingHigh 最终永远被禁止。它很臭，但很管用。

优点：

这可以通过配置文件完成（使用alertmanager.config helm 参数）。
CPUThrottlingHigh 警报仍然存在于 Prometheus 上分析。
CPUThrottlingHigh 警报仅显示在 Alertmanager UI（如果选中了“Inhibited”框）。
我的接收器上没有烦人的通知。

缺点：

DeadMansSwitch 或prometheus 标签设计的任何更改都会破坏这一点（这只意味着警报会再次触发）。

更新： 我的缺点变成了现实......

stable/prometheus-operator 4.0.0 中的 DeadMansSwitch altertname just changed。如果使用此版本（或更高版本），新警报名称为 Watchdog。

【讨论】：

为了规避不断变化的警报名称 (Watchdog)，您可以添加带有表达式 vector(1) 的录制规则并在禁止配置中使用它。

【解决方案3】：

一种选择是将您想要静音的警报路由到“空”接收器。在alertmanager.yaml:

route:
  # Other settings...
  group_wait: 0s
  group_interval: 1m
  repeat_interval: 1h

  # Default receiver.
  receiver: "null"

  routes:
  # continue defaults to false, so the first match will end routing.
  - match:
      # This was previously named DeadMansSwitch
      alertname: Watchdog
    receiver: "null"
  - match:
      alertname: CPUThrottlingHigh
    receiver: "null"
  - receiver: "regular_alert_receiver"

receivers:
  - name: "null"
  - name: regular_alert_receiver
    <snip>

【讨论】：

【解决方案4】：

您可以通过Robusta 发送警报来使其静音。（免责声明：我写了罗布斯塔。）

这是一个例子：

- triggers:
  - on_prometheus_alert: {}
  actions:
  - name_silencer:
      names: ["Watchdog", "CPUThrottlingHigh"]

但是，这可能不是您想要做的！

一些CPUThrottlingHigh 警报是垃圾邮件，无法像the one for metrics-server on GKE. 那样修复。

但是，一般来说，警报是有意义的，并且可以指示一个真正的问题。 Typically the best-practice is to change or remove the pod's CPU limit..

当我为 Robusta 编写了一个自动操作手册，分析了每个 CPUThrottlingHigh 并推荐了最佳实践时，我花费了比我愿意承认的更多时间来查看 CPUThrottlingHigh。

【讨论】：