【问题标题】:CWAgent Metric alarms created using Terraform doesn't get data points collected (stays in Insufficient data)使用 Terraform 创建的 CWAgent Metric 警报未收集数据点(停留在数据不足中)
【发布时间】:2019-11-22 00:30:20
【问题描述】:

我已经使用 Terraform 创建了一个 CloudWatch 内存利用率警报,但警报没有移动到 OK 状态(停留在 INSUFFICIENT_DATA)。但是,当我从 AWS 管理控制台手动创建具有相同配置的相同警报时,它移动到 OK 状态并且我看到了数据点。

我已在尝试创建警报的 EC2 实例中成功安装 CloudWatch 代理,并且可以在 CloudWatch 指标部分查看指标。

我的 Terraform 代码:​​

resource "aws_cloudwatch_metric_alarm" "memory" {
  alarm_name = "memory-utilization-alarm-${var.env}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name = "mem_used_percent"
  namespace = "CWAgent"
  period = "300"
  statistic = "Average"
  threshold = "${var.alarms_memory_threshold}"
  alarm_description = "This metric monitors ec2 memory utilization"
  alarm_actions = [ "${aws_sns_topic.sns_topic.arn}" ]

  dimensions = {
    InstanceId = "${var.instance_id}"
    ImageId = "${var.ami_id}"
  }

  tags = {
    Environment = "${var.env}"
    Project = "${var.project}"
    Provisioner="cloudwatch"
    Name = "${local.name}.memory"
  }
}

描述使用 Terraform 创建的警报的 AWS CLI 输出:

aws cloudwatch describe-alarms --alarm-names memory-utilization-alarm-dev
{
    "MetricAlarms": [
        {
            "EvaluationPeriods": 1, 
            "TreatMissingData": "missing", 
            "AlarmArn": "arn:aws:cloudwatch:us-west-2:289914521333:alarm:memory-utilization-alarm-dev", 
            "StateUpdatedTimestamp": "2019-07-12T08:45:07.020Z", 
            "AlarmConfigurationUpdatedTimestamp": "2019-07-12T08:45:07.020Z", 
            "ComparisonOperator": "GreaterThanOrEqualToThreshold", 
            "AlarmActions": [
                "arn:aws:sns:us-west-2:289914521333:sns-topic"
            ], 
            "AlarmDescription": "This metric monitors ec2 memory utilization", 
            "Namespace": "CWAgent", 
            "Period": 300, 
            "StateValue": "INSUFFICIENT_DATA", 
            "Threshold": 80.0, 
            "AlarmName": "memory-utilization-alarm-dev", 
            "Dimensions": [
                {
                    "Name": "InstanceId", 
                    "Value": "i-03417f2d90d3dc6ca"
                }, 
                {
                    "Name": "ImageId", 
                    "Value": "ami-09d1383e2a5ae8a93"
                }
            ], 
            "Statistic": "Average", 
            "StateReason": "Unchecked: Initial alarm creation", 
            "InsufficientDataActions": [], 
            "OKActions": [], 
            "ActionsEnabled": true, 
            "MetricName": "mem_used_percent"
        }
    ]
}

来自描述使用 AWS 控制台创建的警报的 AWS CLI 输出:

aws cloudwatch describe-alarms --alarm-names memory-utilization-alarm
{
    "MetricAlarms": [
        {
            "Dimensions": [
                {
                    "Name": "InstanceId", 
                    "Value": "i-03417f2d90d3dc6ca"
                }, 
                {
                    "Name": "ImageId", 
                    "Value": "ami-09d1383e2a5ae8a93"
                }, 
                {
                    "Name": "InstanceType", 
                    "Value": "t3.large"
                }
            ], 
            "Namespace": "CWAgent", 
            "DatapointsToAlarm": 1, 
            "ActionsEnabled": true, 
            "MetricName": "mem_used_percent", 
            "EvaluationPeriods": 1, 
            "StateValue": "OK", 
            "StateUpdatedTimestamp": "2019-07-12T09:49:28.749Z", 
            "AlarmConfigurationUpdatedTimestamp": "2019-07-12T09:47:55.914Z", 
            "AlarmActions": [
                "arn:aws:sns:us-west-2:289914521333:sns-topic"
            ], 
            "InsufficientDataActions": [], 
            "AlarmArn": "arn:aws:cloudwatch:us-west-2:289914521333:alarm:memory-utilization-alarm", 
            "StateReasonData": "{\"version\":\"1.0\",\"queryDate\":\"2019-07-12T09:49:28.746+0000\",\"startDate\":\"2019-07-12T09:44:00.000+0000\",\"statistic\":\"Average\",\"period\":300,\"recentDatapoints\":[61.253520518958474],\"threshold\":80.0}", 
            "Threshold": 80.0, 
            "StateReason": "Threshold Crossed: 1 out of the last 1 datapoints [61.253520518958474 (12/07/19 09:44:00)] was not greater than or equal to the threshold (80.0) (minimum 1 datapoint for ALARM -> OK transition).", 
            "OKActions": [], 
            "AlarmDescription": "memory-utilization-alarm", 
            "Period": 300, 
            "ComparisonOperator": "GreaterThanOrEqualToThreshold", 
            "AlarmName": "memory-utilization-alarm", 
            "Statistic": "Average", 
            "TreatMissingData": "missing"
        }
    ]
}

【问题讨论】:

  • 你能分享创建警报的 Terraform 代码吗?描述 Terraform 创建的警报和控制台创建的警报的输出是什么?你可以从aws cloudwatch describe-alarms --alarm-names [ALARM NAME]
  • 我已经在这里添加了代码,之前我遇到了格式问题:)
  • mem_used_percent 有一个额外的维度 InstanceType
  • 我已经添加了您要求的内容。 describe-alarms 为创建的两个警报,除了 InstanceType 是否还缺少其他内容?
  • 添加 InstanceType 有效! @ydaetskcoR 非常感谢您的帮助

标签: amazon-web-services terraform terraform-provider-aws


【解决方案1】:

来自 Cloudwatch 代理的 mem_used_percent 指标有 3 个维度:InstanceIdImageIdInstanceTypeAWS user guide 目前未列出每个指标的维度,但您可以使用以下 AWS CLI 命令找到这些维度:

$ aws cloudwatch list-metrics --namespace CWAgent --metric-name mem_used_percent --query 'Metrics[0].Dimensions[].Name'
[
    "InstanceId", 
    "ImageId", 
    "InstanceType"
]

要修复警报,您需要更改警报定义以包含 InstanceType 维度:

resource "aws_cloudwatch_metric_alarm" "memory" {
  alarm_name = "memory-utilization-alarm-${var.env}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name = "mem_used_percent"
  namespace = "CWAgent"
  period = "300"
  statistic = "Average"
  threshold = "${var.alarms_memory_threshold}"
  alarm_description = "This metric monitors ec2 memory utilization"
  alarm_actions = [ "${aws_sns_topic.sns_topic.arn}" ]

  dimensions = {
    InstanceId = "${var.instance_id}"
    ImageId = "${var.ami_id}"
    InstanceType = "${var.instance_type}"
  }

  tags = {
    Environment = "${var.env}"
    Project = "${var.project}"
    Provisioner="cloudwatch"
    Name = "${local.name}.memory"
  }
}

【讨论】:

  • 谢谢,这帮助很大。我遇到了一个问题,我使用InstanceID 而不是InstanceId 拼写错误。我能够比较工作手动警报的 JSON 输出与非工作 TF 创建的警报,这也有帮助。
猜你喜欢
  • 1970-01-01
  • 2018-04-27
  • 2021-11-28
  • 1970-01-01
  • 2022-07-28
  • 1970-01-01
  • 2021-12-12
  • 1970-01-01
  • 2016-08-12
相关资源
最近更新 更多