【问题标题】:Job failed on Cloud ML after successful completion of 1000成功完成 1000 后 Cloud ML 上的作业失败
【发布时间】:2017-10-25 11:33:39
【问题描述】:

我已经浏览了这个关于人口普查数据的 cloudML 教程:cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction,其中的工作是成功的。但是,当我浏览本教程关于花卉图像数据:https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow 时,我的训练任务似乎成功了,因为完成了日志中的 1000 步。但是,在完成此快照StackDriver logs 后,它说作业失败。我尝试使用相同的结构替换人口普查数据演练中的命令行参数,删除并重新创建 JOB_ID 和 --output_path 用户参数,使用 STANDARD_1 比例层但无济于事。我可以从社区获得的任何帮助将不胜感激。谢谢!

以下是错误,您可以看到在日志快照的尾部弹出:

{
 textPayload: "The replica master 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
    run(model, argv)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
    dispatch(args, model, cluster, task)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
    self.eval(session)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
    self.model.format_metric_values(self.evaluator.evaluate()))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 95, in evaluate
    return metric_values
  File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/queue_runner_impl.py", line 234, in _run
    sess.run(enqueue_op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
NotFoundError: Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
     when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
     [[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
Caused by op u'ReaderReadUpToV2', defined at:
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 542, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 305, in main
    run(model, argv)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 436, in run
    dispatch(args, model, cluster, task)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 477, in dispatch
    Trainer(args, model, cluster, task).run_training()
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 241, in run_training
    self.eval(session)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 283, in eval
    self.model.format_metric_values(self.evaluator.evaluate()))
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 57, in evaluate
    self.eval_batch_size)
  File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 310, in build_eval_graph
    return self.build_graph(data_paths, batch_size, GraphMod.EVALUATE)
  File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 231, in build_graph
    num_epochs=None if is_training else 2)
  File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 52, in read_examples
    filename_queue, batch_size)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/io_ops.py", line 226, in read_up_to
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 380, in _reader_read_up_to_v2
    num_records=num_records, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2327, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1226, in __init__
    self._traceback = _extract_stack()
NotFoundError (see above for traceback): Error executing an HTTP request (HTTP response code 404, error code 0, error message '')
     when reading gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
     [[Node: ReaderReadUpToV2 = ReaderReadUpToV2[_device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReaderV2, input_producer, ReaderReadUpToV2/num_records)]]
To find out more about why your job exited please check the logs: console.cloud.google.com/logs/viewer?project=123456234&resource=ml_job%2Fjob_id%2Fflowers_User_20170524_145125&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22flowers_User_20170524_145125%22"***

【问题讨论】:

    标签: tensorflow google-app-engine google-cloud-platform tensorflow-serving google-cloud-ml


    【解决方案1】:

    该错误表示尝试读取时未找到 404

    gs://project-166422-ml/User/flowers_User_20170522_121407/preproc/eval
    

    那个文件存在吗?

    根据名称,我猜这是评估数据。所以我的猜测是你每 1000 步只运行一次评估,这就是 1000 步成功完成的原因。然后它尝试运行评估,但由于数据不存在而失败。

    【讨论】:

    • 感谢您的见解。我查看了我的文件夹结构,并在此用户参数之后附加了一个“/”: --output_path "${GCS_PATH}/preproc/eval" 并导致了错误。现在解决了。对于遇到此类错误的任何人,请不要这样做:'--output_path "${GCS_PATH}/preproc/eval/' 就像我做的那样破坏了事情。
    猜你喜欢
    • 2021-06-16
    • 2020-02-02
    • 2021-11-23
    • 2019-07-21
    • 1970-01-01
    • 2019-07-31
    • 2021-09-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多