为什么第二次编译后loss会飙升？答案

【问题标题】：Why does the loss spike up after compiling for a second time?为什么第二次编译后loss会飙升？
【发布时间】：2021-09-09 00:44:12
【问题描述】：

我目前正在开展一个项目，该项目需要我在使用 Tensorflow 进行训练期间更改模型架构的一半。添加了新的权重，并删除了其他权重。模型需要重新编译，以便优化器识别新的权重并为它们计算梯度。

但是我注意到，编译网络后，损失仅在再次下降后才飙升（请参阅here）在编译后的第一步中，损失仍然和以前一样低，但增加得很快。 This 问题与我的类似，但只是说你应该

使用上一次训练的列表（手动或从 Callback 获得）初始化第二次训练验证准确度。

但我找不到任何有关如何执行此操作的资源。我的尝试包括：

使用 SGD 代替 Adam，因为它不应该依赖于之前的状态
添加上一次model.fit() 通话的历史记录
将 model._train_counter 设置为它在上一次调用中执行的 epoch 数
以上所有组合

我使用来自 https://www.tensorflow.org/datasets/keras_example 的修改示例重新创建了该问题，并增加了网络复杂性，因为尖峰的高度似乎随着网络规模的增加而增加：

import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
(ds_train, ds_test), ds_info = tfds.load(
    'cifar10',
    split=['train', 'test'],
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

def normalize_img(image, label):
  """Normalizes images: `uint8` -> `float32`."""
  return tf.cast(image, tf.float32) / 255., label

ds_train = ds_train.map(
    normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(256)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE).repeat()

ds_test = ds_test.map(
    normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_test = ds_test.batch(256)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)

#%% Define Model    
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(512,activation='relu'),
  tf.keras.layers.Dense(256,activation='relu'),
  tf.keras.layers.Dense(128,activation='relu'),
  tf.keras.layers.Dense(128,activation='relu'),
  tf.keras.layers.Dense(10)
])


#%% First compilation
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

history1 = model.fit(
    ds_train,
    epochs=8,
    steps_per_epoch=300,
    validation_data=ds_test,
)

#%% Compile again
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()],
)

history2 = model.fit(
    ds_train,
    epochs=10,
    steps_per_epoch=1,
    validation_data=ds_test,
)
#%% plot results
plt.plot(history1.history['loss']+history2.history['loss'])
plt.show()

This 是结果图。在此示例中，我没有更改网络，而是使用不同的优化器进行编译，无论您选择哪种组合，我都测试了损失峰值。（如果你用model.optimizer编译而不改变模型，损失不会增加，这让我觉得我必须改变优化器。但是SGD也不起作用，这让我很困惑）

这与您在使用另一个 model.fit() 调用恢复后恢复模型训练时出现的问题相同。

我使用的是 TensorFlow 2.5.0 版

关于如何解决或解决此问题的任何想法？

【问题讨论】：

标签： python tensorflow keras resuming-training

【解决方案1】：

更新：我没有解决该问题，但使用学习率计划解决了该问题，该计划在编译步骤后才慢慢开始再次增加。这可以防止模型离开已经存在的局部最小值。

如果您有类似的问题，您可以尝试使用model.compile(...,run_eagerly=True) 编译模型，因此对于训练 TensorFlow 不会计算计算图。这意味着您不必在更改架构后重新编译模型。它对我不起作用，但我有一个非常具体的架构。

【讨论】：