【发布时间】:2021-08-25 02:27:56
【问题描述】:
我正在尝试将TF Tokenizer 用于 NLP 模型
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ",
"This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]
tokenizer.fit_on_texts(sample_text)
print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))
操作:
[[1, 7, 8, 9]]
Word_Index:
print(tokenizer.index_word[8]) ===> 'ab'
print(tokenizer.index_word[9]) ===> 'cdefghijklmnopqrstuvwxyz'
问题在于tokenizer 在这种情况下基于. 创建令牌。我在Tokenizer 中给出split = " ",所以我期待以下操作:
[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'
正如我希望标记器基于space (" ") 而不是任何special characters 创建words
如何让tokenizer 只在spaces 上创建令牌?
【问题讨论】:
标签: tensorflow keras text tensorflow2.0