【发布时间】:2019-11-01 13:23:14
【问题描述】:
在使用分布式策略 MirroredStrategy 时,我们会遇到一个 验证准确性差。 在单个 GPU 上运行没有分配策略的训练, 训练和验证准确率均在 95% 以上。
问题出现在 tf.keras.resnet50 模型上。用“自建”小 CNNs,分布策略很好。
似乎优化器对 tf.keras 模型有问题。
有谁知道可能是什么问题以及如何解决它? 我们的想法已经用完了。
常规设置
- CUDA 10.0
- tf-nightly-gpu 2.1.0.dev20191029
- 2x RTX 2080 Ti
- 自定义灰度图像(270、270),具有基于 tf.data.Dataset 的经过良好测试的输入管道。小型 CNN 的准确率超过 95%。
我们已经尝试过并导致了类似行为:
- 用CUDA10.1自建tf2.0
- 带有 CUDA 10.0 的 pip 包 tensorflow-gpu (V2.0.0)
- 不同的优化器
设置 A:
在单个 gpu 上的 ResNet50 的验证准确率超过 95%。
设置 B:
MirroredStrategy 范围内的 ResNet50,2 个 GPU 的性能低于 70% 验证准确性。
由于我们使用两个 GPU,因此批量大小是 2 的倍数。
代码:
df_trainset[target_label] = df_trainset[target_label].astype('int')
df_validset[target_label] = df_validset[target_label].astype('int')
df_testset[target_label] = df_testset[target_label].astype('int')
list_labels_train = df_trainset[target_label]
list_paths_train = df_trainset['sample_path']
list_labels_valid = df_validset[target_label]
list_paths_valid = df_validset['sample_path']
list_labels_test = df_testset[target_label]
list_paths_test = df_testset['sample_path']
def parse_img(label, path):
img = tf.io.read_file(path)
img = tf.image.decode_png(img, channels=1, dtype=tf.uint8)
img = tf.image.convert_image_dtype(img, tf.float32)
return img, label
BATCH_SIZE = 32
ds_train = tf.data.Dataset.from_tensor_slices((list_labels_train,
list_paths_train))
ds_train = ds_train.map(parse_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(buffer_size=len(list_paths_train), seed=42,
reshuffle_each_iteration=True)
ds_train = ds_train.batch(batch_size=BATCH_SIZE, drop_remainder=True).repeat()
ds_train = ds_train.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
train_steps = np.ceil(len(list_paths_train) / BATCH_SIZE)
# valid
ds_valid = tf.data.Dataset.from_tensor_slices((list_labels_valid,
list_paths_valid))
ds_valid = ds_valid.map(parse_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_valid = ds_valid.cache()
ds_valid = ds_valid.shuffle(buffer_size=len(list_paths_valid), seed=42,
reshuffle_each_iteration=True)
ds_valid = ds_valid.batch(batch_size=BATCH_SIZE, drop_remainder=True).repeat()
ds_valid = ds_valid.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
valid_steps = np.ceil(len(list_paths_valid) / BATCH_SIZE)
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = ResNet50(include_top=True,
weights=None,
input_tensor=None,
input_shape=(270, 270, 1),
pooling=None,
classes=3)
model.compile(optimizer=tf.optimizers.Adam(),
loss='sparse_categorical_crossentropy',
metrics=["accuracy"],
)
model.summary()
history = model.fit(
x=ds_train,
epochs=10,
verbose=1,
validation_data=ds_valid,
steps_per_epoch=train_steps,
validation_steps=valid_steps,
use_multiprocessing=False)
结果
在一台设备上训练,没有分配策略,一切看起来都很好。
563/563 [==============================] - 141s 250ms/step - loss: 0.1185 - accuracy: 0.9616 - val_loss: 0.5751 - val_accuracy: 0.8078
Epoch 2/10
563/563 [==============================] - 130s 231ms/step - loss: 0.0400 - accuracy: 0.9865 - val_loss: 0.8953 - val_accuracy: 0.7119
Epoch 3/10
563/563 [==============================] - 130s 231ms/step - loss: 0.0478 - accuracy: 0.9870 - val_loss: 25.3537 - val_accuracy: 0.3367
Epoch 4/10
563/563 [==============================] - 130s 230ms/step - loss: 0.0309 - accuracy: 0.9906 - val_loss: 0.0576 - val_accuracy: 0.9946
Epoch 5/10
563/563 [==============================] - 129s 230ms/step - loss: 0.0210 - accuracy: 0.9940 - val_loss: 0.0780 - val_accuracy: 0.9916
Epoch 6/10
563/563 [==============================] - 130s 230ms/step - loss: 0.0227 - accuracy: 0.9937 - val_loss: 0.0595 - val_accuracy: 0.9887
Epoch 7/10
563/563 [==============================] - 129s 230ms/step - loss: 0.0160 - accuracy: 0.9949 - val_loss: 0.0536 - val_accuracy: 0.9946
Epoch 8/10
81/563 [===>..........................] - ETA: 1:39 - loss: 0.0222 - accuracy: 0.9945
分配策略培训
563/563 [==============================] - 119s 211ms/step - loss: 1.0535 - accuracy: 0.5099 - val_loss: 1.0735 - val_accuracy: 0.6682
Epoch 2/10
563/563 [==============================] - 95s 169ms/step - loss: 1.0123 - accuracy: 0.5277 - val_loss: 1.0721 - val_accuracy: 0.6682
Epoch 3/10
563/563 [==============================] - 95s 169ms/step - loss: 1.0121 - accuracy: 0.5277 - val_loss: 1.0709 - val_accuracy: 0.6682
Epoch 4/10
563/563 [==============================] - 95s 169ms/step - loss: 1.0124 - accuracy: 0.5277 - val_loss: 1.0667 - val_accuracy: 0.6682
Epoch 5/10
563/563 [==============================] - 95s 169ms/step - loss: 1.0121 - accuracy: 0.5277 - val_loss: 1.0687 - val_accuracy: 0.6682
Epoch 6/10
563/563 [==============================] - 95s 168ms/step - loss: 1.0125 - accuracy: 0.5277 - val_loss: 1.0638 - val_accuracy: 0.6682
Epoch 7/10
563/563 [==============================] - 94s 167ms/step - loss: 1.0125 - accuracy: 0.5277 - val_loss: 1.0639 - val_accuracy: 0.6682
Epoch 8/10
400/563 [====================>.........] - ETA: 24s - loss: 1.0135 - accuracy: 0.5268
【问题讨论】:
标签: python tensorflow deep-learning tf.keras