Albert TF Hub 模型（多 GPU）上的 Tf 2.0 MirroredStrategy答案

【问题标题】：Tf 2.0 MirroredStrategy on Albert TF Hub model (multi gpu)Albert TF Hub 模型（多 GPU）上的 Tf 2.0 MirroredStrategy
【发布时间】：2020-03-08 15:59:18
【问题描述】：

我正在尝试在同一台机器的多个 GPU 上运行 Albert Tensorflow 集线器版本。该模型在单个 GPU 上完美运行。

这是我的代码结构：

strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync)) # it prints 2 .. correct
if __name__ == "__main__":
    with strategy.scope():
        run()

在run() 函数中，我读取数据、构建模型并拟合它。

我收到此错误：

Traceback (most recent call last):
  File "Albert.py", line 130, in <module>
    run()
  File "Albert.py", line 88, in run
    model = build_model(bert_max_seq_length)
  File "Albert.py", line 55, in build_model
    model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
  File "/home/****/py_transformers/lib/python3.5/site-packages/tensorflow_core/python/training/tracking/base.py", line 457, in _method_wrapper
    result = method(self, *args, **kwargs)
  File "/home/bighanem/py_transformers/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training.py", line 471, in compile
    '  model.compile(...)'% (v, strategy))
ValueError: Variable (<tf.Variable 'bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>) was not created in the distribution strategy scope of (<tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f62e399df60>). It is most likely due to not all layers or the model or optimizer being created outside the distribution strategy scope. Try to make sure your code looks similar to the following.
with strategy.scope():
  model=_create_model()
  model.compile(...)

有没有可能是因为Albert模型是tensorflow团队之前准备好的（构建和编译的）？

已编辑：

确切地说，Tensorflow 版本是2.1。

另外，这是我加载 Albert 预训练模型的方式：

features = {"input_ids": in_id, "input_mask": in_mask, "segment_ids": in_segment, }

albert = hub.KerasLayer(
    "https://tfhub.dev/google/albert_xxlarge/3",
    trainable=False, signature="tokens", output_key="pooled_output",
)
x = albert(features)

跟随本教程：SavedModels from TF Hub in TensorFlow 2

【问题讨论】：

标签： tensorflow tf.keras multi-gpu pre-trained-model tensorflow-hub

【解决方案1】：

两部分答案：

1) TF Hub 托管两个版本的 ALBERT（每个版本有多种尺寸）：

https://tfhub.dev/google/albert_base/3 等来自最初开发 ALBERT 的 Google 研究团队来自 hub.Module format for TF1。这可能不适用于 TF2 分发策略。
来自 TensorFlow 模型花园的https://tfhub.dev/tensorflow/albert_en_base/1 等来自修订后的TF2 SavedModel format。 请尝试这个在 TF2 中使用分发策略。

2) 也就是说，当前的问题似乎是错误消息（节略）中解释的内容：

变量“bert/embeddings/word_embeddings”未在分发策略范围内创建...尝试确保您的代码类似于以下内容。

with strategy.scope():
  model = _create_model()
  model.compile(...)

对于 SavedModel（来自 TF Hub 或其他），加载需要在分发策略范围内进行，因为这就是在当前程序中重新创建 tf.Variable 对象的过程。具体来说，以下任何一种从 TF Hub 加载 TF2 SavedModel 的方法都必须在分发策略范围内发生，分发才能正常工作：

tf.saved_model.load();
hub.load()，只调用tf.saved_model.load()（必要时下载后）；
hub.KerasLayer 与字符串值模型句柄一起使用时，然后在其上调用 hub.load()。

【讨论】：

如果我理解正确，您正在为 TF1 加载 hub.Module 格式的模型，而不是在分发策略范围内加载它。请再看我的回答。
不，但我使用的是hub.KerasLayer，而不是TF1.*的load函数。
我遇到了同样的问题。当您构建模型时，该层应该是 bert_results = hub.KerasLayer(TFAutoModel.from_pretrained("nlpaueb/bert-base-greek-uncased-v1"), trainable=True, name='BERT_encoder')([input_word_ids, input_mask, segment_ids]) ，因此当您调用 strategy.scope() 时它会立即下载模型