【发布时间】:2016-11-04 10:57:35
【问题描述】:
我正在尝试使用最新的 EMR 在 AWS 上加载一个包含 1TB 数据的数据库。而且运行时间很长,甚至 6 个小时都没有完成,但是在运行 6h30m 之后,我收到一些错误消息,宣布 Container 在 lost 节点上释放,然后作业失败。日志是这样的:
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
我很确定我的网络设置可以正常工作,因为我曾尝试在同一环境中的一个小得多的表上运行此脚本。
另外,我知道有人在 6 个月前发布了一个问题,询问相同的问题:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released,但我仍然需要问,因为没有人回答这个问题。
【问题讨论】:
-
我遇到了同样的问题。没有答案:(
-
@clay 只是我的猜测。当价格高于您的价格时,Spot 实例将被收回,然后该节点将丢失。因此,如果您正在从事长期工作,请不要使用现场实例。我找到了一种方法将我的数据集拆分为许多小任务,每个任务只运行 5 分钟,并在 s3 上保存一个 reduce 结果,毕竟,从 s3 读取结果并进行另一个 reduce,这样我就可以避免 long运行作业。
-
我也遇到了这个问题:/
-
这里有类似的问题(但是有很大的自加入)。现在已经打了一段时间了。资源管理器上的日志只是说容器丢失了。没有迹象表明为什么。内存可能是个问题。
-
可以分享节点的日志吗?
标签: apache-spark hadoop-yarn emr