【问题标题】:How to tokenize sentence using nlp如何使用 nlp 标记句子
【发布时间】:2019-04-08 18:16:51
【问题描述】:

我是 NLP 的新手。我正在尝试在 python 3.7 上使用 nlp 标记句子。所以我使用了以下代码

import nltk
text4="This is the first sentence.A gallon of milk in the U.S. cost 
$2.99.Is this the third sentence?Yes,it is!"
x=nltk.sent_tokenize(text4)
x[0]

我原以为 x[0] 会返回第一句话,但我得到了

Out[4]: 'This is the first sentence.A gallon of milk in the U.S. cost $2.99.Is this the third sentence?Yes,it is!'

我做错了吗?

【问题讨论】:

  • 标点符号之间没有空格,因此它们不是有效句子

标签: python nlp tokenize


【解决方案1】:

您的句子中需要有效的空格和标点符号才能正常运行:

import nltk

text4 = "This is a sentence. This is another sentence."
nltk.sent_tokenize(text4)

# ['This is a sentence.', 'This is another sentence.']

## Versus What you had before

nltk.sent_tokenize("This is a sentence.This is another sentence.")

# ['This is a sentence.This is another sentence.']

【讨论】:

    【解决方案2】:

    NLTK sent_tokenizer 不能很好地处理格式错误的文本。如果您提供适当的间距,那么它可以工作。

    import nltk
    nltk.download('punkt')
    text4="This is the first sentence. A gallon of milk in the U.S. cost $2.99. Is this 
    the third sentence? Yes, it is"
    x=nltk.sent_tokenize(text4)
    x[0]
    

    或 你可以用这个。

    import re
    text4 = "This is the first sentence. A gallon of milk in the U.S. cost 2.99. Is this 
    the third sentence? Yes it is"
    sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text4)
    sentences
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-12-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多