BERT 的输入是令牌 ID。如何将相应的输入令牌向量输入 BERT？答案

【问题标题】：The inputs into BERT are token IDs. How do I get the corresponding the input token VECTORs into BERT?BERT 的输入是令牌 ID。如何将相应的输入令牌向量输入 BERT？
【发布时间】：2021-12-16 12:52:52
【问题描述】：

我是新手，正在学习变形金刚。

在很多 BERT 教程中，我看到输入只是单词的标记 id。但是我们肯定需要将此令牌 ID 转换为向量表示（它可以是一个热编码，或者每个令牌 ID 的任何初始向量表示），以便模型可以使用它。

我的问题是：我在哪里可以找到每个标记的初始向量表示？

【问题讨论】：

您好，在当前的问题状态下，我相信您可能会在Cross Validated 上得到（理论上正确的）答案。否则，请随意添加一段更具体的代码，以便我们大致了解您所指的具体型号。

标签： nlp huggingface-transformers bert-language-model word-embedding

【解决方案1】：

在 BERT 中，输入是 string 本身。然后，BERT 设法将其转换为令牌，然后创建其向量。我们来看一个例子：

prep_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
enc_url = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4' 
bert_preprocess = hub.KerasLayer(prep_url)
bert_encoder = hub.KerasLayer(enc_url)

text = ['Hello I"m new to stack overflow']

# First, you need to preprocess the data

preprocessed_text = bert_preprocess(text)
# this will give you a dict with a few keys such us input_word_ids, that is, the tokenizer

encoded = bert_encoder(preprocessed_text)
# and this will give you the (1, 768) vector with the context value of the previous text. the output is encoded['pooled_output']

# you can play with both dicts, printing its keys()

我建议您访问以上两个链接并进行一些研究。回顾一下，BERT 使用字符串作为输入，然后对其进行标记（使用自己的标记器！）。如果您想使用相同的值进行标记，则需要相同的词汇文件，但对于像您这样的新开始，这应该足够了。

【讨论】：

谢谢！所以看起来输入实际上是令牌 ID。使用令牌 ID 代替其他方法（如词袋、一种热编码等）是否有某种好处？因为这个而迷茫。比如为什么要使用令牌 ID？它就像一个序数编码方案，您将单词表示为 ids
@woowz 编码器输入不仅仅是令牌 ID。它还有其他层，我建议您按照我的代码查看，但是是的，最重要的是令牌 ID。而且，那个令牌就像是一袋文字，但更深。它具有连接字（例如##ing）和保留字，例如[CLS]。所有这些都包含在词汇文件中