Prometheus 查询计算 avg_over_time up-time，但想忽略 down-time 小于 1 分钟答案

【问题标题】：Prometheus query to calculate avg_over_time up-time, but want to ignore down-time less than 1 minutePrometheus 查询计算 avg_over_time up-time，但想忽略 down-time 小于 1 分钟
【发布时间】：2020-04-02 02:44:11
【问题描述】：

我是 Prometheus 的新手，并在下面进行了查询，试图以百分比显示某个网站的平均正常运行时间以进行 SLA 监控（例如 Google）。

(avg_over_time(probe_success{instance="https://www.google.com/"}[$__range])) * 100

但是，是否可以让计算忽略任何少于 1 分钟的停机时间？

【问题讨论】：

标签： prometheus grafana prometheus-node-exporter prometheus-blackbox-exporter

【解决方案1】：

为探针制定 SLA 的最佳方法是使用分位数函数，例如：

quantile_over_time(0.99, probe_success{instance="https://www.google.com/"}[$__range])

不完全是这个查询，但需要从基本考虑分位数。

也就是说，直接回答问题，避免 1 分钟的停机时间，这会有所帮助：

avg_over_time(((avg_over_time(probe_success{instance="https://www.google.com"}[75s]) * 75) > bool(60))[$__range:]) * 100

现在让我们剖析一下这个查询：

avg_over_time(probe_success{instance="https://www.google.com"}[75s]) 得到 75 秒内探测的平均值，因此我们可以尝试忽略 1m 的停机时间。打电话给UP_TIME_PERCENTAGE。

UP_TIME_PERCENTAGE * 75 提供过去 75 秒的运行时间（以秒为单位）。打电话给UP_TIME_75S。

UP_TIME_75S > bool(60) 提供布尔值 1 或 0 时间线，指示正常运行时间是否超过一分钟。打电话给IS_UP_MORE_THAN_1M

avg_over_time(IS_UP_MORE_THAN_1M[$__range:]) * 100 导致在给定的$__range 中正常运行时间超过 1m 的探针的百分比。注意:。对子查询应用..._over_time 方法很重要。

【讨论】：