【问题标题】:AWS Fargate failing healthchecksAWS Fargate 未能通过运行状况检查
【发布时间】:2021-07-11 05:48:37
【问题描述】:

我有一个要在 AWS 上使用 Fargate 启动的任务定义。现在没有任何负载平衡和东西。我只想运行任务。定义如下:

{
  "ipcMode": null,
  "executionRoleArn": "arn:aws:iam::941606308749:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "dnsSearchDomains": null,
      "environmentFiles": null,
      "logConfiguration": {
        "logDriver": "awslogs",
        "secretOptions": null,
        "options": {
          "awslogs-group": "/ecs/web",
          "awslogs-region": "eu-central-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
      "entryPoint": null,
      "portMappings": [
        {
          "hostPort": 8000,
          "protocol": "tcp",
          "containerPort": 8000
        }
      ],
      "command": null,
      "linuxParameters": null,
      "cpu": 512,
      "environment": [
        {
          "name": "AWS_STORAGE_BUCKET_NAME",
          "value": "blacksheep-dev2"
        },
        {
          "name": "CELERY_BROKER_HOST",
          "value": "https://sqs.eu-central-1.amazonaws.com/941606308749/BlackSheepLearnsBroker"
        },
        {
          "name": "POSTGRES_DB",
          "value": "postgres"
        },
        {
          "name": "POSTGRES_HOST",
          "value": "blacksheeplearnsdb.c9a9ehc0s9ms.eu-central-1.rds.amazonaws.com"
        },
        {
          "name": "POSTGRES_USER",
          "value": "postgres"
        },
        {
          "name": "ROLLBAR_ENABLED",
          "value": "True"
        }
      ],
      "resourceRequirements": null,
      "ulimits": null,
      "dnsServers": null,
      "mountPoints": [],
      "workingDirectory": null,
      "secrets": null,
      "dockerSecurityOptions": null,
      "memory": null,
      "memoryReservation": 1024,
      "volumesFrom": [],
      "stopTimeout": null,
      "image": "941606308749.dkr.ecr.eu-central-1.amazonaws.com/blacksheeplearns:latest",
      "startTimeout": null,
      "firelensConfiguration": null,
      "dependsOn": null,
      "disableNetworking": null,
      "interactive": null,
      "healthCheck": {
        "retries": 3,
        "command": [
          "CMD-SHELL",
          "curl -f http://localhost:8000/health/ || exit 1"
        ],
        "timeout": 5,
        "interval": 30,
        "startPeriod": 30
      },
      "essential": true,
      "links": null,
      "hostname": null,
      "extraHosts": null,
      "pseudoTerminal": null,
      "user": null,
      "readonlyRootFilesystem": null,
      "dockerLabels": {
        "project": "BlackSheepLearns"
      },
      "systemControls": null,
      "privileged": null,
      "name": "web"
    }
  ],
  "placementConstraints": [],
  "memory": "1024",
  "taskRoleArn": "arn:aws:iam::941606308749:role/ecsTaskRole",
  "compatibilities": [
    "EC2",
    "FARGATE"
  ],
  "taskDefinitionArn": "arn:aws:ecs:eu-central-1:941606308749:task-definition/web:14",
  "family": "web",
  "requiresAttributes": [
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-awslogs"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.21"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.container-health-check"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.execution-role-ecr-pull"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "ecs.capability.task-eni"
    },
    {
      "targetId": null,
      "targetType": null,
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
    }
  ],
  "pidMode": null,
  "requiresCompatibilities": [
    "FARGATE"
  ],
  "networkMode": "awsvpc",
  "cpu": "512",
  "revision": 14,
  "status": "ACTIVE",
  "inferenceAccelerators": null,
  "proxyConfiguration": null,
  "volumes": []
}

但是,当我想启动它时,它会启动大约 1.5 分钟,然后就被杀死了。我怀疑这与健康检查有关。

在某些时候它只是收到一个终止信号并停止。这里没有配置目标组或负载均衡器:

2021-07-10 10:48:40
[2021-07-10 08:48:40 +0000] [1] [INFO] Shutting down: Master
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:48:39
[2021-07-10 08:48:39 +0000] [13] [INFO] Worker exiting (pid: 13)
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:48:39
[2021-07-10 08:48:39 +0000] [14] [INFO] Worker exiting (pid: 14)
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:48:39
[2021-07-10 08:48:39 +0000] [1] [INFO] Handling signal: term
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:33
WARNING:rollbar:Rollbar already initialized. Ignoring re-init.
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:33
WARNING:rollbar:Rollbar already initialized. Ignoring re-init.
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:32
INFO:root:Retrieving secret: ROLLBAR_KEY
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:32
INFO:root:Retrieving secret: ROLLBAR_KEY
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:32
INFO:root:Retrieving secret: POSTGRES_PASSWORD
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:32
INFO:root:Retrieving secret: POSTGRES_PASSWORD
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:32
INFO:root:Retrieving secret: SECRET_KEY
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:32
INFO:root:Retrieving secret: SECRET_KEY
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:29
[2021-07-10 08:46:29 +0000] [14] [INFO] Booting worker with pid: 14
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:29
[2021-07-10 08:46:29 +0000] [13] [INFO] Booting worker with pid: 13
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:29
[2021-07-10 08:46:29 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:29
[2021-07-10 08:46:29 +0000] [1] [INFO] Using worker: sync
ccb94e999e294bdbaadc3f941b786603
2021-07-10 10:46:29
[2021-07-10 08:46:29 +0000] [1] [INFO] Starting gunicorn 20.0.4
ccb94e999e294bdbaadc3f941b786603

所以从我看到的情况来看,服务正在启动,gunicorn(python web 应用程序的服务器)正在启动,监听端口 8000,我已经映射等等。我还在我的应用程序中公开了/health/ 端点,以便进行简单且轻量级的健康检查(它只返回 200 秒)。然而在服务控制台中,我不断收到:

3a52c067-63bd-4d58-a092-a69d29380962 2021-07-10 11:46:08 +0200 service web task 7982151a4a904a82b077fc48410dd672 容器健康检查失败。

我做错了什么?

【问题讨论】:

  • 如果没有 ALB,那么检查您的健康状况是什么?
  • 这件事似乎是 Fargate 负责的。
  • 您是否将任务作为 ecs 服务的一部分运行?
  • 任务定义中有一个健康检查,如果我理解正确,它只有在它是服务的一部分时才起作用
  • 是的。使用 Fargate。

标签: amazon-web-services amazon-ecs aws-fargate


【解决方案1】:

你能做 2 次与 healthcheck 相关的检查并写下你的发现吗?

  • 增加超时时间是否通过运行状况检查?
  • 在您提供的任务定义上增加 cpu 或 ram 时会发生什么?

【讨论】:

    猜你喜欢
    • 2017-12-26
    • 2019-10-29
    • 2021-06-13
    • 2022-10-04
    • 1970-01-01
    • 1970-01-01
    • 2015-11-16
    • 2019-04-25
    • 2019-06-27
    相关资源
    最近更新 更多