在 AWS 中为 EC2 实例创建警报答案

【问题标题】：Creating Alarm in AWS for EC2 Instances在 AWS 中为 EC2 实例创建警报
【发布时间】：2016-10-03 09:57:58
【问题描述】：

如何在何时创建警报 1) EC2 实例运行时间过长（比如说 1 小时）时发出警报。 2)当 EC2 实例数量达到阈值时发出警报（例如一次 5 个实例）

另一个假设是，这些 EC2 实例是特定的。假设这些警报适用于其实例名称以“test”开头的 EC2 实例。

当我尝试创建警报时，我在 Metrics 中没有看到这个逻辑。标准指标包括 CPU 利用率、网络输入、网络输出等。

有没有办法通过定义我们的自定义指标或其他选项来创建此警报？

【问题讨论】：

您浏览过 cloudwatch 文档吗？它们是允许您发布指标的 api
@Shibashis 查看了公共指标。但我不确定指标的逻辑在哪里定义。例如：docs.aws.amazon.com/cli/latest/reference/cloudwatch/…，我只看到指标名称、统计输出和单位已定义.但我想要的是，假设指标名称是 EC2InstanceHealthDuration （表示 EC2 实例在启动之前运行了多长时间）。应该有一些关于指标名称正在执行的逻辑的 unix 脚本。请告诉我在哪里可以找到相同的。
cloudwatch 仅收集统计信息并允许您对其创建警报，您需要创建脚本以查看实例已启动多长时间，然后将该指标推送到 cloudwatch。如果您不想开发这样的逻辑，您可能需要考虑其他软件，如 datadog、new relic、nagios 等
感谢@Shibashis，至于上述要求1）EC2实例运行时间过长（比如1小时）。 2）当 EC2 实例数量达到阈值（比如一次 5 个实例）无法使用标准指标时发出警报。如果我们编写一个自定义指标，我们必须在其中编写脚本的逻辑。我在 put-metric-data (docs.aws.amazon.com/cli/latest/reference/cloudwatch/…) 中看到，没有放置脚本的选项。如果您对此有所了解，那就太好了。
对于推动指标你有很多选择，我建议这三个选项之一 1> 将脚本放在实例本身 2> 创建另一个 ec2 脚本将在其中运行 3> 使用 AWS lambda 计划事件推送指标。

标签： amazon-web-services amazon-ec2 alarm

【解决方案1】：

对于自动部署的实例，无法设置 CloudWatch 警报，因为您不知道实例 ID。设置警报的唯一方法是创建一个 AWS Lambda 函数，该函数将所有正在运行的实例连接起来，并将它们的启动时间与指定的超时时间进行比较。

lambda 函数由CloudWatch - Event – Rule 定期触发。

使用tags 为不同的机器指定不同的运行持续时间。例如，您的启动工具应使用键值“Test”标记实例

请注意，此代码根本没有 NO 保修！这更像是一个例子。

import boto3
import datetime
import json
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText

ec2_client = boto3.client('ec2')

INSTANCE_TIMEOUT = 24
MAX_PERMITTED_INSTANCES = 5 
MAILING_LIST = "jon.doh@simpsons.duf, drunk@beer.com"

def parse_tag(tags, keyValuePair):
    for tag in tags:
        if tag['Key'] == keyValuePair[0] and tag['Value'] == keyValuePair[1]:
            return True
    return False

def runtimeExceeded(instance, timeOutHours):
    # Working in to UTC to avoid time-travel during daylight-saving changeover
    timeNow = datetime.datetime.utcnow()
    instanceRuntime = timeNow - instance.launch_time.replace(tzinfo=None)
    print instanceRuntime
    if instanceRuntime > datetime.timedelta(hours=timeOutHours):
        return True
    else:
        return False

def sendAlert(instance, message):
    msg = MIMEMultipart()
    msg['From'] = 'AWS_Notification@sourcevertex.net'
    msg['To'] = MAILING_LIST
    msg['Subject'] = "AWS Alert: " + message
    bodyText = '\n\nThis message was sent by the AWS Monitor ' + \
        'Lambda. For details see AwsConsole-Lambdas. \n\nIf you want to ' + \
        'exclude an instance from this monitor, tag it ' + \
        'with Key=RuntimeMonitor Value=False'

    messageBody = MIMEText( message + '\nInstance ID: ' +
                    str(instance.instance_id) + '\nIn Availability zone: '
                    + str(instance.placement['AvailabilityZone']) + bodyText)
    msg.attach(messageBody)

    ses = boto3.client('ses')
    ses.send_raw_email(RawMessage={'Data' : msg.as_string()})

def lambda_handler(event, context):
    aws_regions = ec2_client.describe_regions()['Regions']
    for region in aws_regions:
        runningInstancesCount = 0
        try:
            ec2 = boto3.client('ec2', region_name=region['RegionName'])
            ec2_resource = boto3.resource('ec2',
                            region_name=region['RegionName'])
            aws_region = region['RegionName']

            instances = ec2_resource.instances.all()

            for i in instances:
                if i.state['Name'] == 'running':
                    runningInstancesCount +=1
                    if i.tags != None:
                        if parse_tag(i.tags, ('RuntimeMonitor', 'False')):
                            # Ignore these instances
                            pass
                        else:
                            if runtimeExceeded(i, INSTANCE_TIMEOUT):
                                sendAlert(i, "An EC2 instance has been running " + \
                                "for over {0} hours".format(INSTANCE_TIMEOUT))
                    else:
                        print "Untagged instence"
                        if runtimeExceeded(i, UNKNOWN_INSTANCE_TIMEOUT):
                                sendAlert(i, "An EC2 instance has been running " + \
                                "for over {0} hours".format(UNKNOWN_INSTANCE_TIMEOUT))

        except Exception as e:
            print e
            continue

        if runningInstancesCount > MAX_PERMITTED_INSTANCES:
            sendAlert(i, "Number of running instances exceeded threshold  " + \
                    "{0} running instances".format(runningInstancesCount))

    return True

【讨论】：

【解决方案2】：

您可以使用自定义指标在 CloudWatch 中发布事件，然后您可以使用该事件来设置警报。

【讨论】：