【发布时间】:2020-12-21 16:18:55
【问题描述】:
关于这个问题的报告很少,但仍然没有找到答案。在这里简单地说是短代码sn-p:
import tensorflow as tf
from tensorflow.keras import layers
print(tf.__version__)
# 2.3.1
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
model.compile(loss='mse', optimizer='sgd')
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)
model.fit(dataset, epochs=4)
执行后我得到了
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
Epoch 1/4
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 2 all-reduces with algorithm = nccl, num_packs = 1
10/10 [==============================] - 1s 93ms/step - loss: 807385211185512087331799040.0000
Epoch 2/4
10/10 [==============================] - 1s 93ms/step - loss: nan
Epoch 3/4
10/10 [==============================] - 1s 93ms/step - loss: nan
Epoch 4/4
10/10 [==============================] - 1s 93ms/step - loss: nan
10/10 [==============================] - 0s 48ms/step - loss: nan
没有策略,输出看起来正常,损失计算正常
Epoch 1/4
10/10 [==============================] - 0s 2ms/step - loss: 4.2581
Epoch 2/4
10/10 [==============================] - 0s 2ms/step - loss: 1.8821
Epoch 3/4
10/10 [==============================] - 0s 2ms/step - loss: 0.8319
Epoch 4/4
10/10 [==============================] - 0s 2ms/step - loss: 0.3677
10/10 [==============================] - 0s 1ms/step - loss: 0.2284
作为运行时环境,我使用来自 Nvidia GPU Cloud 的 tensorflow 容器 nvcr.io/nvidia/tensorflow:20.10-tf2-py3 - 所以它是最新的并且与所有类型的驱动程序兼容。我也尝试过更新版本20.12-tf2-py3
【问题讨论】:
标签: gpu tensorflow2.0 multi-gpu