【问题标题】:How to troubleshoot TensorFlow error “Restoring from checkpoint failed.”如何解决 TensorFlow 错误“从检查点恢复失败”。
【发布时间】:2020-06-26 03:08:27
【问题描述】:

我是 Tensorflow 的新手,我一直在使用来自 Git 存储库的训练模型。预训练模型保存在“../model/snapshot-38”目录中。我这里有 snapshot-38.index、snapshot-38.meta、snapshot-38.data-00000-of-00001 和检查点文件。我在“../src”中有我的 python 脚本文件和数据,并且在我的代码中我没有使用除这些之外的任何其他位置来保存模型。

def save(self):
    "save model to file"
    self.snapID += 1
    self.saver.save(self.sess, '../model/snapshot', global_step=self.snapID)

我正在使用 Python 3.6、Tensorflow 1.12.2

我已备份这些文件并尝试使用不同的数据集进行重新训练并生成新的模型输出,但中途中止。

然后我像以前一样从备份中恢复了我的预训练模型文件,但从那以后我收到错误“从检查点恢复失败。这很可能是由于当前图表与来自检查点。请确保您没有根据检查点更改预期的图形。原始错误:“删除保存的模型

当我尝试重新训练或恢复模型时。是否有一些我需要删除的临时文件?怀疑 Tensorflow 是否正在尝试做我不知道的事情,我并没有真正从类似线程中的任何解决方案中得到答案。下面是详细的堆栈跟踪,

 as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Validation character error rate of saved model: 10.624916%
Python: 3.6.10 |Anaconda, Inc.| (default, May  7 2020, 19:46:08) [MSC v.1916 64 bit (AMD64)]
Tensorflow: 1.12.0
2020-06-26 00:53:20.161185: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
model DIR ---- ../model/
model latestSnapshot ---- ../model/snapshot-38
Init with stored values from ../model/snapshot-38
Traceback (most recent call last):
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\client\session.py", line 1334, in _do_call
    return fn(*args)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\client\session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\client\session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [1,1,512,71] rhs shape= [1,1,512,80]
         [[{{node save/Assign_15}} = Assign[T=DT_FLOAT, _class=["loc:@Variable_5"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Variable_5, save/RestoreV2:15)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 1546, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\client\session.py", line 929, in run
    run_metadata_ptr)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\client\session.py", line 1328, in _do_run
    run_metadata)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\client\session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Assign requires shapes of both tensors to match. lhs shape= [1,1,512,71] rhs shape= [1,1,512,80]
         [[node save/Assign_15 (defined at P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py:141)  = Assign[T=DT_FLOAT, _class=["loc:@Variable_5"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Variable_5, save/RestoreV2:15)]]

Caused by op 'save/Assign_15', defined at:
  File "main.py", line 145, in <module>
    main()
  File "main.py", line 140, in main
    model = Model(open(FilePaths.fnCharList).read(), decoderType, mustRestore=True, dump=args.dump)
  File "P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py", line 53, in __init__
    (self.sess, self.saver) = self.setupTF()
  File "P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py", line 141, in setupTF
    saver = tf.train.Saver(max_to_keep=1) # saver saves model to file
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 1102, in __init__
    self.build()
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 795, in _build_internal
    restore_sequentially, reshape)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 428, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 119, in restore
    self.op.get_shape().is_fully_defined())
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\ops\state_ops.py", line 221, in assign
    validate_shape=validate_shape)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 60, in assign
    use_locking=use_locking, name=name)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
    op_def=op_def)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [1,1,512,71] rhs shape= [1,1,512,80]
         [[node save/Assign_15 (defined at P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py:141)  = Assign[T=DT_FLOAT, _class=["loc:@Variable_5"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Variable_5, save/RestoreV2:15)]]


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 145, in <module>
    main()
  File "main.py", line 140, in main
    model = Model(open(FilePaths.fnCharList).read(), decoderType, mustRestore=True, dump=args.dump)
  File "P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py", line 53, in __init__
    (self.sess, self.saver) = self.setupTF()
  File "P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py", line 153, in setupTF
    saver.restore(sess, latestSnapshot)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 1582, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [1,1,512,71] rhs shape= [1,1,512,80]
         [[node save/Assign_15 (defined at P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py:141)  = Assign[T=DT_FLOAT, _class=["loc:@Variable_5"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Variable_5, save/RestoreV2:15)]]

Caused by op 'save/Assign_15', defined at:
  File "main.py", line 145, in <module>
    main()
  File "main.py", line 140, in main
    model = Model(open(FilePaths.fnCharList).read(), decoderType, mustRestore=True, dump=args.dump)
  File "P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py", line 53, in __init__
    (self.sess, self.saver) = self.setupTF()
  File "P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py", line 141, in setupTF
    saver = tf.train.Saver(max_to_keep=1) # saver saves model to file
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 1102, in __init__
    self.build()
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 795, in _build_internal
    restore_sequentially, reshape)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 428, in _AddRestoreOps
    assign_ops.append(saveable.restore(saveable_tensors, shapes))
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\training\saver.py", line 119, in restore
    self.op.get_shape().is_fully_defined())
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\ops\state_ops.py", line 221, in assign
    validate_shape=validate_shape)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 60, in assign
    use_locking=use_locking, name=name)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\util\deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\ops.py", line 3274, in create_op
    op_def=op_def)
  File "C:\Users\rcs70\.conda\envs\tensorflow_opencv\lib\site-packages\tensorflow\python\framework\ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [1,1,512,71] rhs shape= [1,1,512,80]
         [[node save/Assign_15 (defined at P:\Desktop\COSC428_ComputerVision\SimpleHTR-master\SimpleHTR-master\src\Model.py:141)  = Assign[T=DT_FLOAT, _class=["loc:@Variable_5"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Variable_5, save/RestoreV2:15)]]

【问题讨论】:

    标签: python deep-learning tensorflow


    【解决方案1】:

    错误是这样说的:Assign requires shapes of both tensors to match. lhs shape= [1,1,512,71] rhs shape= [1,1,512,80]

    这意味着快照中的一个张量的维度与模型中的张量不同,快照中为[1,1,512,80],模型中为[1,1,512,71]

    因此,有些不同。您必须在与保存快照的模型完全匹配的模型上加载快照。

    如果我不得不猜测,我会说这是一个多类分类模型,并且模型训练的类数(即快照)是 80,而现在已经构建了模型以进行分类71 节课。

    【讨论】:

    • 感谢您的回复。我很确定我没有修改备份的模型快照。但是当我重新加载它时它仍然在抱怨,我没有对代码结构进行任何其他更改。你知道 TensorFlow 是否将模型快照缓存在其他地方吗?
    • 我不是说你修改了快照,而是你正在加载快照的模型。
    • 您说:“我已支持这些文件并尝试使用不同的数据集进行重新训练并生成新的模型输出,但中途中止。”是否有可能在新数据中标签的数量不同?
    • 我确实在中途中止了重新训练,因为我意识到数据集不兼容。但据我了解,这不应该影响我备份的原始快照??我已经从该运行中删除了文件(snapshot-38.index、snapshot-38.meta、snapshot-38.data-00000-of-00001 和检查点文件)并从备份中恢复了原始快照。抱歉,我是 TensorFlow 和深度学习的新手。
    • 找出原因.. 正如你提到的那样,配置存在问题。配置文件中缺少 80 个类中的一个,因此只有 79 个类,但模型训练有 80 个。感谢您的时间。
    猜你喜欢
    • 2017-07-30
    • 1970-01-01
    • 2011-02-25
    • 1970-01-01
    • 2019-11-17
    • 1970-01-01
    • 2016-06-14
    • 2017-05-10
    • 1970-01-01
    相关资源
    最近更新 更多