BERT：是否可以在掩码语言建模中过滤预测的标记？答案

【问题标题】：BERT: Is it possible to filter the predicted tokens in masked language modelling?BERT：是否可以在掩码语言建模中过滤预测的标记？
【发布时间】：2021-09-30 18:42:01
【问题描述】：

我使用我自己的数据集训练了一个掩码语言模型，该数据集包含带有表情符号的句子（训练了 20,000 个条目）。

现在，当我进行预测时，我希望表情符号出现在输出中，但是，大多数预测的标记都是单词，所以我认为表情符号在列表的底部某处，因为它们必须更少与单词相比的频繁标记。

到目前为止，这是我的输出 - 你可以看到已经预测了一个表情符号，但其余的预测都是单词：

mask_filler("I am so good today, <mask>", top_k=5)

[{'score': 0.2953376770019531,
  'sequence': 'I am so good today, friend',
  'token': 72,
  'token_str': 'friend'},
 {'score': 0.18523386120796204,
  'sequence': 'I am so good today ????',
  'token': 328,
  'token_str': '????'},
 {'score': 0.1431082785129547,
  'sequence': 'I am so good today, mate',
  'token': 2901,
  'token_str': 'mate'},
 {'score': 0.13269349932670593,
  'sequence': 'I am so good today, father',
  'token': 4,
  'token_str': 'father'},
 {'score': 0.030341114848852158,
  'sequence': 'I am so good today, mother',
  'token': 44660,
  'token_str': 'mother'},

因此，我想知道是否有任何代码或函数可以过滤预测，以便输出中只有表情符号，从而删除任何作为单词的预测标记。

我在输出中显示了一个表情符号，但我认为其余的表情符号是不太常见的标记，因此当我进行预测时它们不会出现在顶部。

那么，是否可以过滤掉单词标记以仅支持表情符号？

谢谢。

【问题讨论】：

标签： python machine-learning bert-language-model huggingface-transformers huggingface-tokenizers

【解决方案1】：

是的，你应该尝试一次——我只是在写提示。

如果输出不包含字符：

打印（输出）

或

您也可以使用正则表达式为表情符号创建模式并过滤掉它们。请检查一次，它可能对你有帮助。 removing emojis from a string in Python

【讨论】：

嗨 - 感谢您的回复。不过，我使用的是 Hugging Face 库，那么这是否适用于经过训练的模型？