【问题标题】:I am getting error "Restoring from checkpoint failed." while training tensorflow estimator api on AI-platform(ml-engine)我收到错误“从检查点恢复失败”。在 AI 平台(ml-engine)上训练 tensorflow estimator api 时
【发布时间】:2019-11-17 06:26:33
【问题描述】:

我正在尝试使用 tensorflow estimator api 对用于 DNN 回归器的 AI 引擎进行超参数调整。但是提交作业后,它显示作业失败,我在作业详细信息中收到此错误。

有人可以帮忙吗?

Hyperparameter Tuning Trial #1 Failed before any other successful trials were completed. The failed trial had parameters: learning_rate=0.0019937718716419557, num-layers=2, first-layer-size=148, scale-factor=0.7910729020312588, .  The trial's error message was: The replica master 0 exited with a non-zero status of 1. 
  Traceback (most recent call last):
    [...]
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 507, in _build_internal
      restore_sequentially, reshape)
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 385, in _AddShardedRestoreOps
      name="restore_shard"))
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
      restore_sequentially)
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
      return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
      name=name)
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
      op_def=op_def)
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
      return func(*args, **kwargs)
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
      op_def=op_def)
    File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
      self._traceback = tf_stack.extract_stack()

  InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

  tensor_name = dnn/hiddenlayer_0/bias; shape in shape_and_slice spec [148] does not match the shape stored in checkpoint: [117]
     [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1403) ]]

【问题讨论】:

    标签: python-3.x google-cloud-ml


    【解决方案1】:

    看起来您对所有试验都使用相同的输出目录,因此试验#1 正在尝试读取试验#2 检查点(可能是因为它是目录中的最新检查点)并且由于体系结构不同而失败

    确保为每次超参数训练运行使用不同的输出目录。有两种方法可以做到这一点:

    1. 使用 --job-dir 作为输出目录。
    2. 将超参数试用号附加到您现在正在使用的输出目录中:

      output_dir = os.path.join( output_dir, json.loads( os.environ.get('TF_CONFIG', '{}') ).get('task', {}).get('trial', '') )

    【讨论】:

      猜你喜欢
      • 2020-02-27
      • 2016-09-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-02-16
      • 2019-07-13
      • 2022-10-23
      • 2020-08-11
      相关资源
      最近更新 更多