【发布时间】:2019-07-03 15:32:25
【问题描述】:
如果我用nltk.tokenize.word_tokenize() 拆分一个句子,然后用' '.join() 重新加入,它不会与原来的完全一样,因为其中带有标点符号的单词会被拆分为单独的标记。
如何像以前一样以编程方式重新加入?
from nltk import word_tokenize
sentence = "Story: I wish my dog's hair was fluffier, and he ate better"
print(sentence)
=> Story: I wish my dog's hair was fluffier, and he ate better
tokens = word_tokenize(sentence)
print(tokens)
=> ['Story', ':', 'I', 'wish', 'my', 'dog', "'s", 'hair', 'was', 'fluffier', ',', 'and', 'he', 'ate', 'better']
sentence = ' '.join(tokens)
print(sentence)
=> Story : I wish my dog 's hair was fluffier , and he ate better
注意: 和's 与原来的不同。
【问题讨论】: