【问题标题】:Training google Object Detection API grpc error训练谷歌对象检测 API grpc 错误
【发布时间】:2017-08-14 21:56:21
【问题描述】:

我在我自己的数据集上关注谷歌的对象检测 API 重新训练,但遇到了一系列问题。

其中之一如下:

"Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main
    worker_job_name, is_chief, FLAGS.train_dir)
  File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 290, in train
    saver=saver)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 776, in train
    master, start_standard_services=False, config=session_config) as sess:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
    self.stop(close_summary_writer=close_summary_writer)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
    stop_grace_period_secs=self._stop_grace_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join
    six.reraise(*self._exc_info_to_raise)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 949, in managed_session
    start_standard_services=start_standard_services)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 713, in prepare_or_wait_for_session
    max_wait_secs=max_wait_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 387, in wait_for_session
    is_ready, not_ready_msg = self._model_ready(sess)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 435, in _model_ready
    return _ready(self._ready_op, sess, "Model not ready")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 492, in _ready
    ready_value = sess.run(op)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call
    raise type(e)(node_def, op, message)
UnavailableError: {"created":"@1502405189.800982817","description":"EOF","file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":235,"grpc_status":14}
"    
  pathname:  "/var/sitecustomize/sitecustomize.py"    
 }

我不太确定 grpc 是什么 - 所以我对这个错误完全处于停顿状态。 任何可以帮助它的人都会很棒! 谢谢!!

【问题讨论】:

标签: tensorflow object-detection google-cloud-ml-engine


【解决方案1】:

这可能是内存不足错误(请参阅this question)。

您可以尝试使用更大的机器类型,特别是对于主机,例如large_modelcomplex_model_lcomplex_model_l_gpu。为此,您可以将文件传递给 gcloud--config 参数,其内容类似于以下内容:

trainingInput:
  runtimeVersion: "1.0"
  scaleTier: CUSTOM
  masterType: complex_model_l_gpu
  workerCount: 9
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

【讨论】:

  • 我正在使用以下 --config 文件:trainingInput: runtimeVersion: "1.0" scaleTier: CUSTOM masterType: standard_gpu workerCount: 5 workerType: standard_gpu parameterServerCount: 3 parameterServerType: standard 另外,我的 tensorflow 不是 tensorflow GPU - 也许这也是一个问题?
  • 问题是您的 masterType: 标准的 RAM 太少。当您使用 GPU 工作者提交作业时,它会在启用了 GPU 的 TensorFlow 的机器上运行。这应该可以加快使用对象检测 API 的训练。不过,也欢迎您尝试不使用 GPU,但我认为这不是问题。
猜你喜欢
  • 1970-01-01
  • 2018-06-18
  • 1970-01-01
  • 1970-01-01
  • 2018-01-01
  • 2019-03-12
  • 1970-01-01
  • 1970-01-01
  • 2018-05-15
相关资源
最近更新 更多