tensorflow_hub 将 BERT 嵌入 Windows 机器 - 扩展到 albert答案

【问题标题】：tensorflow_hub to pull BERT embedding on windows machine - extending to alberttensorflow_hub 将 BERT 嵌入 Windows 机器 - 扩展到 albert
【发布时间】：2020-03-28 10:13:49
【问题描述】：

最近我发布了这个question 并试图解决我的问题。我的问题是

我的方法正确吗？
我的例句长度分别是7和6-(['New Delhi is the capital of India', 'The capital of India is Delhi'])，即使我加上cls和sep标记，长度也是9和8。max_seq_len参数是10，那为什么x1和@987654325的最后一行@不一样？
当我有超过 2 个句子的段落时如何嵌入？我必须一次通过一句话吗？但是在这种情况下，我不会因为我没有将所有句子一起传递而丢失信息吗？
- 我做了一些额外的研究，似乎我可以将整个段落作为一个句子传递，使用segment_ids 作为段落中所有单词的 0。对吗？
如何嵌入ALBERT？我看到 ALBERT 也有 tokenization.py 文件。但我没有看到vocab.txt。我看到文件30k-clean.vocab。我可以用30k-clean.vocab 代替vocab.txt 吗？

【问题讨论】：

点号。 2：第 1 句长度为 7，第 2 句长度为 6
我已经修复了那个部分
1.您的方法似乎正确
2.您能否使用分词器检查第 1 句和第 2 句的分词，这应该会显示其中一个句子中是否有额外的单词片段
一般来说，词片标记化会在单词不在词汇表中时拆分单词，这会创建比输入标记数更高的标记长度

标签： python windows tensorflow tensorflow-hub

【解决方案1】：

@user2543622，你可以参考官方代码here，在你的情况下，你可以这样做：

import tensorflow_hub as hub
albert_module = hub.Module("https://tfhub.dev/google/albert_base/2", trainable=True)
print(albert_module.get_signature_names()) # should output ['tokens', 'tokenization_info', 'mlm']
# then 
tokenization_info = albert_module(signature="tokenization_info",
                                  as_dict=True)
with tf.Session() as sess:
  vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                        tokenization_info["do_lower_case"]])
print(vocab_file) # output b'/var/folders/v6/vnz79w0d2dn95fj0mtnqs27m0000gn/T/tfhub_modules/098d91f064a4f53dffc7633d00c3d8e87f3a4716/assets/30k-clean.model'

我猜这个vocab_file 是一个二进制sentencepiece 模型文件，所以你应该按照下面的方式进行标记化，而不是使用 30k-clean.vocab。

# you still need the tokenization.py code to perform full tokenization
return tokenization.FullTokenizer(
  vocab_file=vocab_file, do_lower_case=do_lower_case,
  spm_model_file=FLAGS.spm_model_file)

如果你只需要嵌入矩阵值，你看看albert_module.variable_map，例如：

print(albert_module.variable_map['bert/embeddings/word_embeddings'])
# <tf.Variable 'module/bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>

【讨论】：

我不想每次都连接到互联网https://tfhub.dev/google/albert_base/2，因为它会减慢进程。这就是为什么我开发了上面的解决方案，我已经将模型下载到我的机器上。请检查我的方法
对，我理解你的担心，其实它只是第一次下载模型，以后只使用URL作为密钥来检索磁盘上的模型，而不是从互联网上下载.我已经使用这个功能有一段时间了。我假设您必须至少下载一次才能使用它，对吗？如果您之前已经下载过模型，您甚至可以避免再次下载，只需设置os.environ['TFHUB_CACHE_DIR'] = 'your model location'
你能回答问题 2,3,4 吗？或者/并且您能否将这两个句子嵌入并在此处过去，以便我可以使用我得到的嵌入来检查它们？
我放了一个简单的demohere，不知道是不是你想要的。
我收到一个错误spm_model_file b'C:\\Users\\nnn\\AppData\\Local\\Temp\\tfhub_modules\\098d91f064a4f53dffc7633d00c3d8e87f3a4716\\assets\\30k-clean.model' Traceback (most recent call last): File "<ipython-input-13-51ee36f48688>", line 20, in <module> spm_model_file=spm_model_file) TypeError: __init__() got an unexpected keyword argument 'spm_model_file'。我必须下载tokenization.py吗？您能否也看看我的问题 1、2、3，尤其是 2？

【解决方案2】：

您的方法似乎正确
能否请您使用分词器检查第 1 句和第 2 句的分词，这应该可以显示其中一个句子中是否有额外的词片。这可以检查如下：

import tokenization
tokenizer = tokenization.FullTokenizer(vocab_file=<PATH to Vocab file>, do_lower_case=True)
tokens = tokenizer.tokenize(example.text_a)
print(tokens)

这应该为您提供词片标记化列表，没有 [CLS] 和 [SEP] 标记。

通常，词片标记化会在单词不在词汇表中时拆分单词，这会创建比输入标记数更高的标记长度。

可以同时传递两个句子，前提是词片分词后的段落长度不超过 max_sequence 长度。
albert 的词汇文件位于./data/vocab.txt 目录中。前提是您从 here 获得了阿尔伯特代码。如果您从tf-hub 获得模型，则文件为2/assets/30k-clean.vocab

【讨论】：

我得到了下面的结果tokens = tokenizer.tokenize('New Delhi is the capital of India') print(tokens) ['New', 'Delhi', 'is', 'the', 'capital', 'of', 'India'] tokens = tokenizer.tokenize('The capital of India is Delhi') print(tokens) ['The', 'capital', 'of', 'India', 'is', 'Delhi']似乎没有多余的单词:(因此仍然不确定-为什么x1和x2的最后一行不一样？
因为你的第一句有7个字，第二句有8个字！！
不，第二句有6个字['The', 'capital', 'of', 'India', 'is', 'Delhi']
我将 max_seq_len 参数更改为 20，但最后一个元素的值仍然不同。 convert_sentences_to_features(sentences, tokenizer, 20)#max_seq_len parameter