HuggingFace Tokenizer：如何获取 unicodes 字符串的令牌？答案

【问题标题】：HuggingFace Tokenizer: how to get a token for unicodes strings?HuggingFace Tokenizer：如何获取 unicodes 字符串的令牌？
【发布时间】：2021-10-19 04:52:13
【问题描述】：

以下代码没有为 unicode 字符串 '\uf0b7' 提供标记：

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True) 
test_words = ['crazy', 'character', '\uf0b7']
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102]  # <- where is the token for the third word?

print(f'word ids:  {input_ids.word_ids()}')
# word ids:  [None, 0, 1, None]   # <- where is the third word (indice 2)?

有没有办法告诉分词器给 unicode 词一个令牌（例如未知的 [UKN] 令牌或其他任何东西）？

我尝试添加一个归一化器，但输出是一样的：

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
normalizer = normalizers.Sequence([NFD(), StripAccents()])
tokenizer.normalizer = normalizer
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102]

【问题讨论】：

标签： python nlp huggingface-tokenizers

【解决方案1】：

将所需的 Unicode 添加为特殊标记？

    special_tokens_dict = {'additional_special_tokens': ['\uf0b7']}
    num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
    test_words = ['crazy', 'character', '\uf0b7']
    tokenizer(test_words, is_split_into_words=True)

输出：

{'input_ids': [101, 4689, 2839, 30522, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

【讨论】：

您的解决方案有效。唯一的“缺点”是您需要遍历所有语料库并找到原始标记器中没有标记的所有 unicode，但我可以忍受。