Tensorflow 文本分词器不正确分词答案

【问题标题】：Tensorflow text tokenizer incorrect tokenizationTensorflow 文本分词器不正确分词
【发布时间】：2021-08-25 02:27:56
【问题描述】：

我正在尝试将TF Tokenizer 用于 NLP 模型

from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ", 
               "This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]

tokenizer.fit_on_texts(sample_text)

print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))

操作：

[[1, 7, 8, 9]]

Word_Index：

print(tokenizer.index_word[8])  ===> 'ab'
print(tokenizer.index_word[9])  ===> 'cdefghijklmnopqrstuvwxyz'

问题在于tokenizer 在这种情况下基于. 创建令牌。我在Tokenizer 中给出split = " "，所以我期待以下操作：

[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'

正如我希望标记器基于space (" ") 而不是任何special characters 创建words

如何让tokenizer 只在spaces 上创建令牌？

【问题讨论】：

标签： tensorflow keras text tensorflow2.0

【解决方案1】：

Tokenizer 采用另一个名为 filter 的参数，当前默认为所有 ascii 标点符号 (filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')。在标记化过程中，filter 中包含的所有字符都将替换为指定的split 字符串。

如果你查看Tokenizer 的源代码，特别是fit_on_texts 方法，你会看到它使用函数text_to_word_sequence 接收filter 字符并将它们视为与@ 相同987654330@它还收到：

def text_to_word_sequence(... ):
    ...
    translate_dict = {c: split for c in filters}
    translate_map = maketrans(translate_dict)
    text = text.translate(translate_map)

    seq = text.split(split)
    return [i for i in seq if i]

因此，为了只拆分指定的split，只需将空字符串传递给filter 参数

【讨论】：