【发布时间】:2021-10-19 04:52:13
【问题描述】:
以下代码没有为 unicode 字符串 '\uf0b7' 提供标记:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
test_words = ['crazy', 'character', '\uf0b7']
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102] # <- where is the token for the third word?
print(f'word ids: {input_ids.word_ids()}')
# word ids: [None, 0, 1, None] # <- where is the third word (indice 2)?
有没有办法告诉分词器给 unicode 词一个令牌(例如未知的 [UKN] 令牌或其他任何东西)?
我尝试添加一个归一化器,但输出是一样的:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
normalizer = normalizers.Sequence([NFD(), StripAccents()])
tokenizer.normalizer = normalizer
input_ids = tokenizer(test_words,is_split_into_words=True)
print(f'token ids: {input_ids["input_ids"]}')
# token ids: [101, 4689, 2839, 102]
【问题讨论】:
标签: python nlp huggingface-tokenizers