Google Colab 中 TPU 的 TRANSIENT_ERROR答案

【问题标题】：TRANSIENT_ERROR for TPU in Google ColabGoogle Colab 中 TPU 的 TRANSIENT_ERROR
【发布时间】：2020-03-18 20:19:03
【问题描述】：

我正在尝试使用 tensorflow 2.0 在 TPU 上运行 lrcn keras 模型。模型和生成器在 CPU/GPU 上工作，但我将它们包括在内以供参考。我还初始化了 TPU，它是可见的，除了我运行 .fit() 时，一切看起来都很好：

def frame_generator(self, batch_size, train_test, data_type):
    """Return a generator that we can use to train on. There are
    a couple different things we can return:
    data_type: 'features', 'images'
    """
    # Get the right dataset for the generator.
    train, test = self.split_train_test()
    data = train if train_test == 'train' else test

    #print("Creating %s generator with %d samples." % (train_test, len(data)))

    while 1:
        X, y = [], []

        # Generate batch_size samples.
        for _ in range(batch_size):
            if random.random() < .5:
                # real
                while True:
                    # Get a random sample.
                    sample = random.choice(data)

                    # Get the sequence from disk.
                    (_x,_y) = self.get_extracted_sequence(data_type, sample)

                    if _y==[0,1]:
                        break
            else:
                 # fake
                while True:
                    # Get a random sample.
                    sample = random.choice(data)

                    # Get the sequence from disk.
                    (_x,_y) = self.get_extracted_sequence(data_type, sample)

                    if _y==[1,0]:
                        break

            if _x is None:
                raise ValueError("Can't find sequence. Did you generate them?", sample)

            X.append(_x)
            y.append(_y)

        #yield [np.array(X), np.array(y)], np.array(y)
        yield np.array(X), np.array(y)

train_generator = data.frame_generator(batch_size, 'train', 'images')
val_generator = data.frame_generator(batch_size, 'test', 'images')

optimizer = Adam(lr=1e-5)

with tpu_strategy.scope():
  model = lrcn()
  model.add(tf.keras.layers.Dense(2, activation='softmax'))

  model.compile(loss='binary_crossentropy',
      optimizer=optimizer,
      metrics=['accuracy', tf.compat.v1.losses.log_loss])
  model.summary() 

train_data = tf.data.Dataset.from_generator(lambda:next(train_generator),
                                        (tf.float32, tf.int64),
                                        ([4, 32,299,299,3], [4,2])     
                                      )

val_data = tf.data.Dataset.from_generator(lambda:next(val_generator),
                                        (tf.float32, tf.int64),
                                      ([4, 32,299,299,3], [4,2]) 
                                      )


model.fit(x=train_data, steps_per_epoch=train_steps, validation_steps=test_steps,
      validation_data=val_data,
        epochs=30,
        callbacks=callbacks,
        verbose=1)

在 model.fit 上我得到：

在 6421.0 步上训练，在 1605.0 步上验证

纪元 1/30

UnavailableError Traceback（最近一次调用最后一次）在（） 15 个 epoch=30， 16 个回调=回调， ---> 17 详细=1)

11 帧 /usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

UnavailableError：通道处于 TRANSIENT_FAILURE 状态其他 GRPC 错误信息： {"created":"@1584561754.347859160","description":"通道处于 TRANSIENT_FAILURE 状态","file":"external/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":第2294章，“grpc_status”：14} [操作：__inference_distributed_function_24182 通道处于状态 TRANSIENT_FAILURE","file":"external/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":2294,"grpc_status":14} [Op:__inference_distributed_function_10577]

任何想法如何解决？看起来它在 Google 的网络端。

更新：

部分解决方案是您不应该在 colab 笔记本中安装带有 pip 的 tensorflow2.1 - 您应该在“导入 tensorflow”之前在其自己的单元格中使用

%tensorflow_version 2.x

这会将 TPU 版本从 1.15 更改为 >=2.1

现在，当我运行笔记本时，我会得到更多详细信息：

训练 6902.0 步，验证 1725.0 步纪元 1/30

1/6902 [.......................] - 预计到达时间：20:04:55

NotFoundError Traceback（最近一次调用最后一次） /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py in on_epoch(self, epoch, mode) 766尝试： --> 767 产生 epoch_logs 768终于：

18 帧 NotFoundError: {{function_node __inference_distributed_function_20824}} 没有为与节点 {{node PyFunc}} 兼容的“CPU”设备注册的“PyFunc”OpKernel .已注册：

 [[PyFunc]]
 [[MultiDeviceIteratorGetNextFromShard]]
 [[RemoteCall]]
 [[IteratorGetNextAsOptional]]

在处理上述异常的过程中，又发生了一个异常：

KeyError Traceback（最近一次调用最后一次） /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/callbacks.py in _get_file_path(self, epoch, logs) 1053 如果不是 self.model._in_multi_worker_mode( 1054 ) 或 multi_worker_util.should_save_checkpoint(): -> 1055 返回 self.filepath.format(epoch=epoch + 1, **logs) 1056 其他： 1057 # 如果这是多工人训练，这个工人不应该

KeyError: 'val_accuracy'

【问题讨论】：

标签： tensorflow keras google-colaboratory tpu google-cloud-tpu

【解决方案1】：

TL/DR

在将 python 函数发送到 TPU 之前，您需要安装一个更新的版本来执行它。通过

加载较新的版本

import requests
import os
url = 'http://' + os.environ['COLAB_TPU_ADDR'].split(':')[0] + ':8475/requestversion/2.2.0-dev20200311'
resp = requests.post(url)
print(resp)
%pip install tf-nightly==2.2.0-dev20200311

来自https://github.com/tensorflow/tensorflow/issues/34346

当您使用 Dataset.from_generator（或将生成器传递给 Keras，它将在后台调用它）时，Dataset 将生成器嵌入到其图形中的 PyFunc 操作中，并且每次调用该操作时，它都会在生成器并获取结果字节。（基本上把 Python 当作一个黑盒子。）

当一切都在同一台机器上运行时，这很好，但问题是 TPU 的工作方式是有一台单独的机器控制 TPU（想象中称为 TPU 主机控制器。^^），而你通过向 TPU 发送要执行的 TensorFlow 图来在 TPU 上运行。因此，包含该 PyFunc 的图形被发送到 TPU，而 TPU 无法执行它，因为 TPU 主机上没有 Python。（即使有，它也不会是与本地机器具有相同状态的同一个解释器。）所以它失败了，告诉你它不能执行 PyFunc 操作，但不幸的是不是很清楚。

【讨论】：

collab bc 仍然没有修复，现在我得到“GRPC ERROR: FAILED TO PICK SUBCHANNEL”