当模型在 GPU 上训练时，Tensorflow 在 CPU 上加载权重答案

【问题标题】：Tensorflow loading weights on CPU when model is trained on GPU当模型在 GPU 上训练时，Tensorflow 在 CPU 上加载权重
【发布时间】：2021-06-29 10:54:57
【问题描述】：

我在 Colab 中编写了一个 Bert 模型，并使用 GPU 对其进行了训练，并下载了权重以进行进一步推理。对于预测，我不需要 GPU，我在没有 GPU 的本地机器上进行测试。但是在我的本地 PC 中加载时出现以下错误，而 Colab 上没有错误。我不知道如何继续。

File "/home/akash/anaconda3/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 909,
 in load_internal str(err) + "\n If trying to load on a different device from the "
FileNotFoundError: Op type not registered 'CaseFoldUTF8' in binary running on akash. Make sure
the Op and Kernel are registered in the binary running in this process.
Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) 
`tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered 
when the module is first accessed.

我已经加载了，

self.classifier_model = self.build_classifier_model()
self.classifier_model.load_weights(BERT_HEADING)

pip list | grep 'tensorflow'的输出

tensorflow                         2.5.0
tensorflow-addons                  0.13.0
tensorflow-datasets                4.3.0
tensorflow-estimator               2.5.0
tensorflow-hub                     0.12.0
tensorflow-metadata                1.1.0
tensorflow-model-optimization      0.6.0
tensorflow-text                    2.5.0

我的模特：

bert_model_name = 'small_bert/bert_en_uncased_L-8_H-512_A-8'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
bert_model = hub.KerasLayer(tfhub_handle_encoder)

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(updated_data_frame['heading'].nunique(), activation='softmax', name='classifier')(net)
  return tf.keras.Model(text_input, net)

classifier_model = build_classifier_model()

epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)

init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
classifier_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=['CategoricalAccuracy'])
print(f'Training model with {tfhub_handle_encoder}')
history = classifier_model.fit(x=train_ds,
                               validation_data=val_ds,
                               epochs=5)

saved_model_path = 'resume_headings.h5'
classifier_model.save_weights(saved_model_path)

reloaded_model= build_classifier_model() # <-- This was working fine on Colab but giving an error (detailed desc above)
reloaded_model.load_weights(saved_model_path)

【问题讨论】：

标签： python tensorflow keras nlp tf.keras

【解决方案1】：

您遇到的错误很可能是由于您没有在本地安装tensorflow-text（或者您使用BERT的环境没有安装tensorflow-text）。

我进行此观察是因为我可以看到您将 PIP 和 CONDA 作为包管理器，并且很容易陷入陷阱（不知道您使用的是什么环境/虚拟环境）。

例如，您可以有一个使用 TF 2.3 和 pandas 1.0.0 的环境，而另一个使用 TF 2.5 和 pandas 1.2.0 的环境。当然，如果你使用第二个环境，你就不能使用 pandas > 1.0.0 版本的功能，因为依赖关系实际上是相对于那个特定环境的。我希望它对你来说变得更清楚了。

另外请确保您在本地拥有与您训练 BERT 模型的版本完全相同的 TensorFlow。

(使用!pip install tensorflow-text安装)

你也可以在这里看到一个非常相似的错误：https://github.com/tensorflow/hub/issues/705

同时，如果你想确保你的模型加载到 GPU 上，使用这个 sn-p/logic：

with tf.device('/GPU:0'):
    model = load_model()

【讨论】：

我已经在本地机器上安装了tensorflow-text，我还在上面的帖子中添加了pip list | grep 'tensorflow'的输出。我用来在 GPU 上训练模型的 colab 也安装了 tensorflow-text 和相同的版本。但我无法真正理解那里的问题。我想在我的 CPU 而不是 GPU 上加载权重，我尝试了不同的方法，但都没有奏效。
load_model() 似乎指向 anaconda，而不是 pip。如果您尝试从中创建模型的 Anaconda 环境没有“看到”，那么使用 PIP 安装 tensorflow-text 并不重要。您可以在本地计算机上拥有多个环境，并且很可能在某些环境中拥有，例如 TF 2.2.0 和其他一些 2.5.0。当然它们之间是有区别的。
我的建议是从头开始创建一个虚拟环境（不依赖于 Anaconda）并尝试重做这些步骤（当然需要安装所有依赖项）
我尝试在新的 venv 中安装所有软件包，但除了跟踪已更改外，仍然存在相同的错误。现在来自/home/akash/bert/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py
所以每个依赖项都具有与 Colab 相同的版本？（包括 Python 版本？）