保存和加载自定义 Tensorflow 模型（自回归 seq2seq 多元时间序列 GRU/RNN）答案

【问题标题】：Save and Load Custom Tensorflow Model (Autoregressive seq2seq multivariate time series GRU/RNN)保存和加载自定义 Tensorflow 模型（自回归 seq2seq 多元时间序列 GRU/RNN）
【发布时间】：2021-05-12 22:55:13
【问题描述】：

我正在尝试实现一个自回归 seq-2-seq RNN 来预测时间序列数据，as shown in this TensorFlow tutorial。该模型由一个自定义模型类组成，继承自tf.keras.Model，其代码可以在下面找到。我已将此模型用于时间序列预测，输入数据为 (15, 108) 数据集（维度：（序列长度，输入单位）），输出数据为 (10, 108) 数据集。

虽然训练成功，但我未能成功保存并重新加载模型以在测试集上评估先前训练的模型。我曾尝试在互联网上寻找解决方案，但没有一个到目前为止似乎有效。这可能是因为它是使用 Eager Execution 训练的自定义模型，因为多个线程无法解决在这些条件下保存模型的问题。

谁能给我一些关于如何解决这个问题的提示。非常感谢任何帮助，谢谢！

到目前为止，我已经使用tf.keras.models.load_model(filepath) 加载了模型，并尝试了以下保存选项。这两个选项的代码可以在下面找到：

使用keras.callbacks.ModelCheckpoint 函数保存。但是，只返回了一个 .ckpt.data-00000-of-00001 和一个 .ckpt.index 文件（因此没有 .meta 或 .pb 文件），我无法打开它们
使用tf.saved_model.save 函数保存并加载导致以下错误的模型：


    WARNING:tensorflow:Looks like there is an object (perhaps variable or layer) that is shared between different layers/models. This may cause issues when restoring the variable values. Object: <tensorflow.python.keras.layers.recurrent_v2.GRUCell object at 0x7fac1c052eb8>
    WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. Either the Trackable object references in the Python program have changed in an incompatible way, or the checkpoint was generated in an incompatible program.
    
    Two checkpoint references resolved to different objects (<tensorflow.python.keras.layers.recurrent_v2.GRUCell object at 0x7fac20648048> and <tensorflow.python.keras.layers.recurrent_v2.GRUCell object at 0x7fac1c052eb8>).
    ---------------------------------------------------------------------------
    AssertionError                            Traceback (most recent call last)
    <ipython-input-7-ac3fac428428> in <module>()
          1 model = '/content/drive/My Drive/Colab Notebooks/Master thesis/NN_data/saved_model/s-20210208-194847'
    ----> 2 new_model = tf.keras.models.load_model(model)
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/save.py in load_model(filepath, custom_objects, compile, options)
        210       if isinstance(filepath, six.string_types):
        211         loader_impl.parse_saved_model(filepath)
    --> 212         return saved_model_load.load(filepath, compile, options)
        213 
        214   raise IOError(
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/saving/saved_model/load.py in load(path, compile, options)
        142   for node_id, loaded_node in keras_loader.loaded_nodes.items():
        143     nodes_to_load[keras_loader.get_path(node_id)] = loaded_node
    --> 144   loaded = tf_load.load_partial(path, nodes_to_load, options=options)
        145 
        146   # Finalize the loaded layers and remove the extra tracked dependencies.
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py in load_partial(export_dir, filters, tags, options)
        763     A dictionary mapping node paths from the filter to loaded objects.
        764   """
    --> 765   return load_internal(export_dir, tags, options, filters=filters)
        766 
        767 
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py in load_internal(export_dir, tags, options, loader_cls, filters)
        888       try:
        889         loader = loader_cls(object_graph_proto, saved_model_proto, export_dir,
    --> 890                             ckpt_options, filters)
        891       except errors.NotFoundError as err:
        892         raise FileNotFoundError(
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py in __init__(self, object_graph_proto, saved_model_proto, export_dir, ckpt_options, filters)
        159 
        160     self._load_all()
    --> 161     self._restore_checkpoint()
        162 
        163     for node in self._nodes:
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py in _restore_checkpoint(self)
        486     else:
        487       load_status = saver.restore(variables_path, self._checkpoint_options)
    --> 488     load_status.assert_existing_objects_matched()
        489     checkpoint = load_status._checkpoint
        490 
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/util.py in assert_existing_objects_matched(self)
        806           ("Some Python objects were not bound to checkpointed values, likely "
        807            "due to changes in the Python program: %s") %
    --> 808           (list(unused_python_objects),))
        809     return self
        810 
    
    AssertionError: Some Python objects were not bound to checkpointed values, likely due to changes in the Python program: [<tf.Variable 'gru_cell_2/bias:0' shape=(2, 648) dtype=float32, numpy=
    array([[0., 0., 0., ..., 0., 0., 0.],
           [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>, <tf.Variable 'gru_cell_2/kernel:0' shape=(108, 648) dtype=float32, numpy=
    array([[ 0.01252341, -0.08176371, -0.00800528, ...,  0.00473534,
            -0.05456369,  0.00294461],
           [-0.02453795,  0.018851  ,  0.07198527, ...,  0.05603079,
            -0.01973856,  0.06883802],
           [-0.06897871, -0.05892187,  0.08031332, ...,  0.07844239,
            -0.06783205, -0.04394536],
           ...,
           [ 0.02367028,  0.07758808, -0.04011653, ..., -0.04074041,
            -0.00352754, -0.03324065],
           [ 0.08708382, -0.0113907 , -0.08592559, ..., -0.07780273,
            -0.07923603,  0.0435034 ],
           [-0.04890796,  0.03626117,  0.01753877, ..., -0.06336015,
            -0.07234246, -0.05076948]], dtype=float32)>, <tf.Variable 'gru_cell_2/recurrent_kernel:0' shape=(216, 648) dtype=float32, numpy=
    array([[ 0.03453588,  0.01778516, -0.0326081 , ..., -0.02686813,
             0.05017178,  0.01470701],
           [ 0.05364531, -0.02074206, -0.06292176, ..., -0.04883411,
            -0.03006711,  0.03091787],
           [ 0.03928262,  0.01209829,  0.01992464, ..., -0.01726807,
            -0.04125096,  0.00977487],
           ...,
           [ 0.03076804,  0.00477963, -0.03565286, ..., -0.00938745,
            -0.06442262, -0.0124091 ],
           [ 0.03680094, -0.04894238,  0.01765203, ..., -0.11990541,
            -0.01906408,  0.10198548],
           [ 0.00818893, -0.03801145,  0.10376499, ..., -0.01700275,
            -0.02600842, -0.0169891 ]], dtype=float32)>]

用于（成功）训练和保存模型的缩短代码：


    model = FeedBack(units=neurons, out_steps=output_len, num_features=108, act_dense=output_activation)
      
    model.compile(loss=loss,optimizer=tf.optimizers.Adam(lr=lr), metrics=['mean_absolute_error', 'mean_absolute_percentage_error', keras.metrics.RootMeanSquaredError()])
    
    cp_callback = keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_best_only=True, verbose=0)
    earlyStopping = keras.callbacks.EarlyStopping(monitor='val_loss', patience=6, verbose=0,  min_delta=1e-9, mode='auto')
    
    # OPTION 1: USE ModelCheckpoint
    r = model.fit(x=train_x, y=train_y, batch_size=32, shuffle=False, epochs=1,validation_data = (test_x, test_y), callbacks=[earlyStopping, cp_callback], verbose=0)
        
    # OPTION 2: USE tf.saved_model.save()
    !mkdir -p saved_model
    model.save('/content/drive/My Drive/Colab Notebooks/Master thesis/NN_data/saved_model/s-%s' % timestring)
    tf.saved_model.save(model, '/content/drive/My Drive/Colab Notebooks/Master thesis/NN_data/saved_model/s-%s' % timestring)

这是构建模型时使用的代码：


    class FeedBack(tf.keras.Model):
        def __init__(self, units, out_steps, num_features, act_dense):
            super().__init__()
            self.out_steps = out_steps
            self.units = units
            self.num_features = num_features
            self.act_dense = act_dense
            self.gru_cell = tf.keras.layers.GRUCell(units)
            # Also wrap the LSTMCell in an RNN to simplify the `warmup` method.
            self.gru_rnn = tf.keras.layers.RNN(self.gru_cell, return_state=True)
            self.dense = tf.keras.layers.Dense(num_features, activation=act_dense) #self.num_features?
    
        def warmup(self, inputs):
            # inputs.shape => (batch, time, features)
            # x.shape => (batch, lstm_units)
            x, state = self.gru_rnn(inputs)
            
            # predictions.shape => (batch, features)
            prediction = self.dense(x)
            return prediction, state
    
        def call(self, inputs, training=None):
            # Use a TensorArray to capture dynamically unrolled outputs.
            predictions = []
            # Initialize the lstm state
            prediction, state = self.warmup(inputs)
    
            # Insert the first prediction
            predictions.append(prediction)
    
            # Run the rest of the prediction steps
            for _ in range(1, self.out_steps):
                # Use the last prediction as input.
                x = prediction
                # Execute one gru step.
                x, state = self.gru_cell(x, states=state,
                                                                    training=training)
                # Convert the gru output to a prediction.
                prediction = self.dense(x)
                # Add the prediction to the output
                predictions.append(prediction)
    
            # predictions.shape => (time, batch, features)
            predictions = tf.stack(predictions)
            # predictions.shape => (batch, time, features)
            predictions = tf.transpose(predictions, [1, 0, 2])
            return predictions

【问题讨论】：

标签： python-3.x tensorflow keras deep-learning recurrent-neural-network

【解决方案1】：

我会说问题出在您提供给 ModelCheckpoint 回调的文件路径上，它应该是一个 hdf5 文件。

以我为例：


ckpt_name = '/work/.../weights/{}.hdf5'.format(log_name)

...
callbacks = [
            TensorBoardImage(...),
            tf.keras.callbacks.ModelCheckpoint(filepath=ckpt_name)
        ]
...
model.fit(train_generator, validation_data=validation_generator, validation_freq=1, epochs=FLAGS['epochs'],
                    callbacks=callbacks)

【讨论】：

【解决方案2】：

认为问题的根源在于，在__init__ 中，您将gru_cell 包装在layers.RNN 中。这会导致相同的gru_cell 被使用两次：一次在warmup() 中，然后再次在call() 中。对于训练来说，这不是问题，但正如您所注意到的，保存模型时它会失败。

用layers.GRU替换您的自定义RNN层

改变这个：

def __init__(self, units, out_steps, num_features, act_dense):
    ...
    self.gru_cell = tf.keras.layers.GRUCell(units)
    # Also wrap the LSTMCell in an RNN to simplify the `warmup` method.
    self.gru_rnn = tf.keras.layers.RNN(self.gru_cell, return_state=True)
    ...

到这里：

def __init__(self, units, out_steps, num_features, act_dense):
    ...
    self.gru_cell = tf.keras.layers.GRUCell(units)
    self.gru_rnn = tf.keras.layers.GRU(units, return_state=True)
    ...

（编辑）
注意：gru_cell 和 gru_rnn 层不会像在原始代码中那样共享它们的权重。从这个意义上说，原始版本更可取，因为相同的GRUCell 对整个序列进行操作。

在我的版本中，layers.GRU 对输入序列进行操作，之后状态将传递给layers.GRUCell。这样做的缺点是layers.GRUCell 的权重必须单独优化（学习），并且使用与layers.GRU 相同的权重不会使表单受益，反之亦然。

【讨论】：

您好 Supercluster，感谢您的建议，它确实解决了我的问题。然而，要训练的参数数量也翻了一番，因为 GRU 单元和 GRU 层都是单独定义的。 gru_cell 和 gru_rnn 是否共享相同的权重，或者它们是独立训练的？性能方面，没有任何变化，但我很想知道引擎盖下发生了什么。谢谢！
你是对的。这是我没有想到的。 GRU 单元和 GRU 层不共享它们的权重。 GRU 单元和 GRU 层的权重将单独优化，这可能会导致不太好的预测（更大的错误），因为 GRU 单元不会受益于 GRU 层“学习”的权重，而是 GRU 单元必须“重新学习”自己的权重，反之亦然。