【问题标题】:Rejoin sentence like original after tokenizing with nltk word_tokenize使用 nltk word_tokenize 进行标记后重新加入原始句子
【发布时间】:2019-07-03 15:32:25
【问题描述】:

如果我用nltk.tokenize.word_tokenize() 拆分一个句子,然后用' '.join() 重新加入,它不会与原来的完全一样,因为其中带有标点符号的单词会被拆分为单独的标记。

如何像以前一样以编程方式重新加入?

from nltk import word_tokenize

sentence = "Story: I wish my dog's hair was fluffier, and he ate better"
print(sentence)
=> Story: I wish my dog's hair was fluffier, and he ate better

tokens = word_tokenize(sentence)
print(tokens)
=> ['Story', ':', 'I', 'wish', 'my', 'dog', "'s", 'hair', 'was', 'fluffier', ',', 'and', 'he', 'ate', 'better']

sentence = ' '.join(tokens)
print(sentence)
=> Story : I wish my dog 's hair was fluffier , and he ate better

注意:'s 与原来的不同。

【问题讨论】:

    标签: python nltk tokenize


    【解决方案1】:

    来自this answer。您可以使用MosesDetokenizer 作为您的解决方案。

    记得先下载nltk的子包:nltk.download('perluniprops')

    >>>import nltk
    >>>sentence = "Story: I wish my dog's hair was fluffier, and he ate better"
    >>>tokens = nltk.word_tokenize(sentence)
    >>>tokens
    ['Story', ':', 'I', 'wish', 'my', 'dog', "'s", 'hair', 'was', 'fluffier', ',', 'and', 'he', 'ate', 'better']
    >>>from nltk.tokenize.moses import MosesDetokenizer
    >>>detokens = MosesDetokenizer().detokenize(tokens, return_str=True)
    >>>detokens
    "Story: I wish my dog's hair was fluffier, and he ate better"
    

    【讨论】:

      【解决方案2】:

      加入后可以使用替换功能

       sentence.replace(" '","'").replace(" : ",': ')
       #o/p 
       Story: I wish my dog's hair was fluffier , and he ate better
      

      【讨论】:

        猜你喜欢
        • 2018-01-02
        • 1970-01-01
        • 2015-05-16
        • 2015-10-18
        • 2018-09-03
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-10-22
        相关资源
        最近更新 更多