GKE：如何提醒内存请求/可分配比率？答案

【问题标题】：GKE: How to alert on memory request/allocatable ratio?GKE：如何提醒内存请求/可分配比率？
【发布时间】：2020-07-01 01:16:39
【问题描述】：

我有一个 GKE 集群，我想跟踪请求的总内存与可分配的总内存之间的比率。我能够使用

在 Google Cloud Monitoring 中创建图表

metric.type="kubernetes.io/container/memory/request_bytes" resource.type="k8s_container"

和

metric.type="kubernetes.io/node/memory/allocatable_bytes" resource.type="k8s_node"

两者都将crossSeriesReducer 设置为REDUCE_SUM，以便获得整个集群的总数。

然后，当我尝试使用两者的比率（遵循this）设置警报策略（使用云监控 api）时，我收到此错误

ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

不喜欢第一个指标是 k8s_container 而第二个指标是 k8s_node 是否可以使用不同的指标或某种解决方法来提醒 Google 中的内存请求/可分配比率云监控？

编辑：

这是完整的请求和响应

$ gcloud alpha monitoring policies create --policy-from-file=policy.json
ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

$ cat policy.json
{
    "displayName": "Cluster Memory",
    "enabled": true,
    "combiner": "OR",
    "conditions": [
        {
            "displayName": "Ratio: Memory Requests / Memory Allocatable",
            "conditionThreshold": {
                 "filter": "metric.type=\"kubernetes.io/container/memory/request_bytes\" resource.type=\"k8s_container\"",
                 "aggregations": [
                    {
                        "alignmentPeriod": "60s",
                        "crossSeriesReducer": "REDUCE_SUM",
                        "groupByFields": [
                        ],
                        "perSeriesAligner": "ALIGN_MEAN"
                    }
                ],
                "denominatorFilter": "metric.type=\"kubernetes.io/node/memory/allocatable_bytes\" resource.type=\"k8s_node\"",
                "denominatorAggregations": [
                   {
                      "alignmentPeriod": "60s",
                      "crossSeriesReducer": "REDUCE_SUM",
                      "groupByFields": [
                       ],
                      "perSeriesAligner": "ALIGN_MEAN",
                    }
                ],
                "comparison": "COMPARISON_GT",
                "thresholdValue": 0.8,
                "duration": "60s",
                "trigger": {
                    "count": 1
                }
            }
        }
    ]
}

【问题讨论】：

如果可能，请编辑您的帖子并显示请求以及返回响应代码。
@DawidKruk 已编辑。我使用了 gcloud cli，所以我没有看到 API 响应代码，但我猜它是 4xx 代码。

标签： google-cloud-platform google-kubernetes-engine stackdriver google-cloud-stackdriver google-cloud-monitoring

【解决方案1】：

ERROR: (gcloud.alpha.monitoring.policies.create) INVALID_ARGUMENT: The numerator and denominator must have the same resource type.

按照官方文档：

groupByFields[] - 参数

指定crossSeriesReducer 时要保留的字段集。 groupByFields 确定在应用聚合操作之前如何将时间序列划分为子集。每个子集包含对每个分组字段具有相同值的时间序列。每个单独的时间序列都是一个子集的成员。 crossSeriesReducer 应用于时间序列的每个子集。 不可能跨不同资源类型减少，因此该字段隐含包含resource.type。 groupByFields 中未指定的字段将被聚合。如果未指定 groupByFields 并且所有时间序列具有相同的资源类型，则时间序列将聚合为单个输出时间序列。如果未定义crossSeriesReducer，则忽略该字段。

-- Cloud.google.com: Monitoring: projects.alertPolicies

请看具体部分：

无法跨不同资源类型减少，因此该字段隐含包含resource.type。

当您尝试使用不同的资源类型创建策略时会显示上述错误。

下面显示的指标有Resource type：

kubernetes.io/container/memory/request_bytes - k8s_container
kubernetes.io/node/memory/allocatable_bytes - k8s_node

您可以通过查看GCP Monitoring 中的指标来检查Resource type：

作为一种解决方法，您可以尝试创建一个警报策略，当内存的可分配利用率超过 85% 时会提醒您。它会间接告诉您请求的内存足够高，可以触发警报。

下面使用 YAML 的示例：

combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_SUM
      groupByFields:
      - resource.label.cluster_name
      perSeriesAligner: ALIGN_MEAN
    comparison: COMPARISON_GT
    duration: 60s
    filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
      resource.label."cluster_name"="GKE-CLUSTER-NAME"
    thresholdValue: 0.85
    trigger:
      count: 1
  displayName: Memory allocatable utilization for GKE-CLUSTER-NAME by label.cluster_name
    [SUM]
  name: projects/XX-YY-ZZ/alertPolicies/AAA/conditions/BBB
creationRecord:
  mutateTime: '2020-03-31T08:29:21.443831070Z'
  mutatedBy: XXX@YYY.com
displayName: alerting-policy-when-allocatable-memory-is-above-85
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T08:29:21.443831070Z'
  mutatedBy: XXX@YYY.com
name: projects/XX-YY-ZZ/alertPolicies/

以GCP Monitoring web access 为例：

如果您对此有任何疑问，请告诉我。

编辑：

要正确创建将显示相关数据的警报策略，您需要考虑很多因素，例如：

工作负载类型
节点和节点池的数量
节点亲和性（例如：在 GPU 节点上产生某种类型的工作负载）
等

对于将考虑可分配内存每个节点池的更高级的警报策略，您可以执行以下操作：

combiner: OR
conditions:
- conditionThreshold:
    aggregations:
    - alignmentPeriod: 60s
      crossSeriesReducer: REDUCE_SUM
      groupByFields:
      - metadata.user_labels."cloud.google.com/gke-nodepool"
      perSeriesAligner: ALIGN_MEAN
    comparison: COMPARISON_GT
    duration: 60s
    filter: metric.type="kubernetes.io/node/memory/allocatable_utilization" resource.type="k8s_node"
      resource.label."cluster_name"="CLUSTER_NAME"
    thresholdValue: 0.85
    trigger:
      count: 1
  displayName: Memory allocatable utilization (filtered) (grouped) [SUM]
creationRecord:
  mutateTime: '2020-03-31T18:03:20.325259198Z'
  mutatedBy: XXX@YYY.ZZZ
displayName: allocatable-memory-per-node-pool-above-85
enabled: true
mutationRecord:
  mutateTime: '2020-03-31T18:18:57.169590414Z'
  mutatedBy: XXX@YYY.ZZZ

请注意有一个错误：Groups.google.com: Google Stackdriver discussion，创建上述警报策略的唯一可能性是使用命令行。

【讨论】：

感谢您的精彩回答！使用可分配利用率的解决方法的唯一问题是，可能很难在具有所有不同可分配内存量的许多节点之间聚合该指标。你能想到任何其他可能的解决方法吗？
@JesseShieh 请看看我编辑的答案。我添加了更高级的警报策略，它将与节点池一起使用。每个节点也可以使用可分配内存指标。关于allocatable utilization 指标的消息：The fraction of the allocatable memory that is currently in use on the instance. This value cannot exceed 1 as usage cannot exceed allocatable memory bytes. 它始终会考虑内存量。
再次感谢您的精彩回答。看起来这基本上会在任何节点池高度使用（超过 85%）时发出警报，但是有没有办法在整个集群的利用率超过 85% 时发出警报？例如，假设有 32 台 4GB 机器均 100% 使用，但有一台 128GB 机器使用率为 0%。整个集群的利用率为 50%，因此不应触发警报。
它不应该触发，因为工作负载也应该在 1x128GB 机器上产生。如果不是，则可能是有原因的（例如node Affinity）。创建此警报策略应包括该内容。此外，您可以随时检查每个节点的利用率。
明白了。我会将您的答案标记为已接受，但我仍然希望有一天能够将请求总数与可分配总数的比率视为整个集群中的一个数字。