【发布时间】:2020-03-18 20:19:03
【问题描述】:
我正在尝试使用 tensorflow 2.0 在 TPU 上运行 lrcn keras 模型。模型和生成器在 CPU/GPU 上工作,但我将它们包括在内以供参考。我还初始化了 TPU,它是可见的,除了我运行 .fit() 时,一切看起来都很好:
def frame_generator(self, batch_size, train_test, data_type):
"""Return a generator that we can use to train on. There are
a couple different things we can return:
data_type: 'features', 'images'
"""
# Get the right dataset for the generator.
train, test = self.split_train_test()
data = train if train_test == 'train' else test
#print("Creating %s generator with %d samples." % (train_test, len(data)))
while 1:
X, y = [], []
# Generate batch_size samples.
for _ in range(batch_size):
if random.random() < .5:
# real
while True:
# Get a random sample.
sample = random.choice(data)
# Get the sequence from disk.
(_x,_y) = self.get_extracted_sequence(data_type, sample)
if _y==[0,1]:
break
else:
# fake
while True:
# Get a random sample.
sample = random.choice(data)
# Get the sequence from disk.
(_x,_y) = self.get_extracted_sequence(data_type, sample)
if _y==[1,0]:
break
if _x is None:
raise ValueError("Can't find sequence. Did you generate them?", sample)
X.append(_x)
y.append(_y)
#yield [np.array(X), np.array(y)], np.array(y)
yield np.array(X), np.array(y)
train_generator = data.frame_generator(batch_size, 'train', 'images')
val_generator = data.frame_generator(batch_size, 'test', 'images')
optimizer = Adam(lr=1e-5)
with tpu_strategy.scope():
model = lrcn()
model.add(tf.keras.layers.Dense(2, activation='softmax'))
model.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics=['accuracy', tf.compat.v1.losses.log_loss])
model.summary()
train_data = tf.data.Dataset.from_generator(lambda:next(train_generator),
(tf.float32, tf.int64),
([4, 32,299,299,3], [4,2])
)
val_data = tf.data.Dataset.from_generator(lambda:next(val_generator),
(tf.float32, tf.int64),
([4, 32,299,299,3], [4,2])
)
model.fit(x=train_data, steps_per_epoch=train_steps, validation_steps=test_steps,
validation_data=val_data,
epochs=30,
callbacks=callbacks,
verbose=1)
在 model.fit 上我得到:
在 6421.0 步上训练,在 1605.0 步上验证
纪元 1/30
UnavailableError Traceback(最近一次调用最后一次) 在 () 15 个 epoch=30, 16 个回调=回调, ---> 17 详细=1)
11 帧 /usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)
UnavailableError:通道处于 TRANSIENT_FAILURE 状态 其他 GRPC 错误信息: {"created":"@1584561754.347859160","description":"通道处于 TRANSIENT_FAILURE 状态","file":"external/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":第2294章,“grpc_status”:14} [操作:__inference_distributed_function_24182 通道处于状态 TRANSIENT_FAILURE","file":"external/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":2294,"grpc_status":14} [Op:__inference_distributed_function_10577]
任何想法如何解决?看起来它在 Google 的网络端。
更新:
部分解决方案是您不应该在 colab 笔记本中安装带有 pip 的 tensorflow2.1 - 您应该在“导入 tensorflow”之前在其自己的单元格中使用
%tensorflow_version 2.x
这会将 TPU 版本从 1.15 更改为 >=2.1
现在,当我运行笔记本时,我会得到更多详细信息:
训练 6902.0 步,验证 1725.0 步 纪元 1/30
1/6902 [.......................] - 预计到达时间:20:04:55
NotFoundError Traceback(最近一次调用最后一次) /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py in on_epoch(self, epoch, mode) 766尝试: --> 767 产生 epoch_logs 768终于:
18 帧 NotFoundError: {{function_node __inference_distributed_function_20824}} 没有为与节点 {{node PyFunc}} 兼容的“CPU”设备注册的“PyFunc”OpKernel .已注册:
[[PyFunc]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
在处理上述异常的过程中,又发生了一个异常:
KeyError Traceback(最近一次调用最后一次) /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py in _get_file_path(self, epoch, logs) 1053 如果不是 self.model._in_multi_worker_mode( 1054 ) 或 multi_worker_util.should_save_checkpoint(): -> 1055 返回 self.filepath.format(epoch=epoch + 1, **logs) 1056 其他: 1057 # 如果这是多工人训练,这个工人不应该
KeyError: 'val_accuracy'
【问题讨论】:
标签: tensorflow keras google-colaboratory tpu google-cloud-tpu