【问题标题】:Marathon application deployment get stuck on Waiting statusMarathon 应用程序部署卡在等待状态
【发布时间】:2017-09-04 09:36:20
【问题描述】:

我有一个运行 Marathon、mesos-master、mesos-slave 和 Zookeeper 并启用 HA 配置的 3 个节点设置,然后使用 mesos-execute 测试了一个简单的 hello 应用程序的部署,它按预期工作。

现在一切看起来都很好,所以我连接到 Marathon 并部署了一个简单的应用程序来测试 marathon:(echo "hello" >> /tmp/output.txt) 但应用程序陷入了“等待”状态。

阻止 Marathon 使用 mesos 资源进行部署的问题可能是什么?

来自 mesos-master 的日志:

I0904 11:23:27.064332 19769 master.cpp:2813] Received SUBSCRIBE call for framework 'marathon' at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324
I0904 11:23:27.064623 19769 master.cpp:2890] Subscribing framework marathon with checkpointing enabled and capabilities [ PARTITION_AWARE ]
I0904 11:23:27.064669 19769 master.cpp:6272] Updating info for framework cb16118a-2257-4020-a907-63aa6294e11b-0000
I0904 11:23:27.064697 19769 master.cpp:2994] Framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324 failed over
I0904 11:23:27.065032 19770 hierarchical.cpp:342] Activated framework cb16118a-2257-4020-a907-63aa6294e11b-0000
I0904 11:23:27.065465 19770 master.cpp:7305] Sending 3 offers to framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324
I0904 11:23:27.907865 19769 http.cpp:1115] HTTP GET for /files/read?_=1504517007920&jsonp=jQuery17109098185077823333_1504516979864&length=50000&offset=352538&path=%2Fmaster%2Flog from 192.168.40.1:53525 with User-Agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
I0904 11:23:28.916651 19768 http.cpp:1115] HTTP GET for /files/read?_=1504517008930&jsonp=jQuery17109098185077823333_1504516979865&length=50000&offset=353797&path=%2Fmaster%2Flog from 192.168.40.1:53525 with User-Agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
E0904 11:23:30.071293 19775 process.cpp:2450] Failed to shutdown socket with fd 39, address 192.168.40.159:58072: Transport endpoint is not connected
I0904 11:23:30.073277 19768 master.cpp:1430] Framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324 disconnected
I0904 11:23:30.073307 19768 master.cpp:3160] Deactivating framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324
I0904 11:23:30.073485 19768 master.cpp:3137] Disconnecting framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324
I0904 11:23:30.073496 19768 master.cpp:1445] Giving framework cb16118a-2257-4020-a907-63aa6294e11b-0000 (marathon) at scheduler-0340362b-0bb6-4fb8-8501-118d976e2cbd@192.168.40.156:36324 1weeks to failover
I0904 11:23:30.073519 19768 hierarchical.cpp:374] Deactivated framework cb16118a-2257-4020-a907-63aa6294e11b-0000

curl -XGET 'http://mesosphere2:8098/v2/queue?pretty' | jq

{
  "queue": [
    {
      "count": 1,
      "delay": {
        "timeLeftSeconds": 0,
        "overdue": true
      },
      "since": "2017-09-04T13:12:42.024Z",
      "processedOffersSummary": {
        "processedOffersCount": 12,
        "unusedOffersCount": 12,
        "lastUnusedOfferAt": "2017-09-04T13:14:52.554Z",
        "rejectSummaryLastOffers": [
          {
            "reason": "UnfulfilledRole",
            "declined": 3,
            "processed": 3
          },
          {
            "reason": "UnfulfilledConstraint",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "NoCorrespondingReservationFound",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientCpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientMemory",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientDisk",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientGpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientPorts",
            "declined": 0,
            "processed": 0
          }
        ],
        "rejectSummaryLaunchAttempt": [
          {
            "reason": "UnfulfilledRole",
            "declined": 12,
            "processed": 12
          },
          {
            "reason": "UnfulfilledConstraint",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "NoCorrespondingReservationFound",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientCpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientMemory",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientDisk",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientGpus",
            "declined": 0,
            "processed": 0
          },
          {
            "reason": "InsufficientPorts",
            "declined": 0,
            "processed": 0
          }
        ]
      },
      "app": {
        "id": "/test03",
        "acceptedResourceRoles": [
          "slave_public"
        ],
        "backoffFactor": 1.15,
        "backoffSeconds": 1,
        "container": {
          "type": "DOCKER",
          "docker": {
            "forcePullImage": false,
            "image": "laghao/hello-marathon",
            "network": "BRIDGE",
            "parameters": [],
            "portMappings": [
              {
                "containerPort": 80,
                "hostPort": 80,
                "labels": {},
                "protocol": "tcp",
                "servicePort": 10003
              }
            ],
            "privileged": false
          },
          "volumes": []
        },
        "cpus": 0.1,
        "disk": 0,
        "executor": "",
        "instances": 1,
        "labels": {},
        "maxLaunchDelaySeconds": 3600,
        "mem": 64,
        "gpus": 0,
        "portDefinitions": [
          {
            "port": 10003,
            "name": "default",
            "protocol": "tcp"
          }
        ],
        "requirePorts": false,
        "upgradeStrategy": {
          "maximumOverCapacity": 1,
          "minimumHealthCapacity": 1
        },
        "version": "2017-09-04T13:12:41.993Z",
        "versionInfo": {
          "lastScalingAt": "2017-09-04T13:12:41.993Z",
          "lastConfigChangeAt": "2017-09-04T13:12:41.993Z"
        },
        "killSelection": "YOUNGEST_FIRST",
        "unreachableStrategy": {
          "inactiveAfterSeconds": 300,
          "expungeAfterSeconds": 600
        }
      }
    }
  ]
}

【问题讨论】:

  • 你能显示马拉松日志吗? waiting 意味着没有可用的资源来满足应用程序的限制。在最新的 Marathon 1.4+ 中,您可以使用 /v2/queue endpoint 调试给定部署缺少哪些资源。

标签: apache-zookeeper mesos marathon mesosphere


【解决方案1】:

来自documentation

应用永远处于“等待”状态 这意味着 Marathon 不会从 Mesos 收到允许它启动此应用程序任务的“资源报价”。最简单的故障是集群中没有足够的可用资源,或者另一个框架占用了所有这些资源。您可以查看 Mesos UI 以获取可用资源。请注意,所需资源(例如 CPU、内存、磁盘)必须在单个主机上全部可用。

如果您自己没有找到解决方案并创建了 GitHub 问题,请将 Mesos /state 端点的输出附加到错误报告中,以便我们检查可用的集群资源。

在您的情况下,应用程序角色要求和代理角色存在问题。你可以从UnfulfilledRole推导出来。

Marathon 1.4 引入了有关卡住部署的信息。您可以查询/v2/queue 并获取报价被拒绝的统计信息。

【讨论】:

  • 好吧,我读到了关于“等待”状态的线程,但是资源是可用的,因为我可以直接通过 mesos 部署,所以问题出在 mesos-marathon 通信之间,在 Marathon Group 中也打开了一个线程:并且 /v2/queue 发布在那里:groups.google.com/forum/#!topic/marathon-framework/r1aKkRXIXAE
  • 看来问题出在角色上。你能告诉你应用程序 json 和代理配置吗?
  • 你说得对——我又改了部署脚本&你可以在群里看看,能不能部署给我反馈一下?
  • 问题是什么?你能改写一下吗?
  • 我修复了角色问题,它是"acceptedResourceRoles": ["slave_public"],,我删除了那行,但应用程序仍处于“等待”状态。
猜你喜欢
  • 2016-06-02
  • 1970-01-01
  • 2016-10-01
  • 2015-04-15
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多