滥用 nltk 的 word_tokenize(sent) 的后果答案

【问题标题】：Consequences of abusing nltk's word_tokenize(sent)滥用 nltk 的 word_tokenize(sent) 的后果
【发布时间】：2013-10-22 18:56:57
【问题描述】：

我正在尝试将段落拆分为单词。我手头有可爱的 nltk.tokenize.word_tokenize(sent)，但 help(word_tokenize) 说，“这个标记器设计用于一次处理一个句子。”

有谁知道如果你在一个段落中使用它会发生什么，即最多 5 个句子，而不是？我自己尝试了几段简短的段落，它似乎有效，但这几乎不是确凿的证据。

【问题讨论】：

nltk.word_tokenize() 现在适用于包含多个句子的文本。

标签： python nltk

【解决方案1】：

试试这种技巧：

>>> from string import punctuation as punct
>>> sent = "Mr President, Mr President-in-Office, indeed we know that the MED-TV channel and the newspaper Özgür Politika provide very in-depth information. And we know the subject matter. Does the Council in fact plan also to use these channels to provide information to the Kurds who live in our countries? My second question is this: what means are currently being applied to integrate the Kurds in Europe?"
# Add spaces before punctuations
>>> for ch in sent:
...     if ch in punct:
...             sent = sent.replace(ch, " "+ch+" ")
# Remove double spaces if it happens after adding spaces before punctuations.
>>> sent = " ".join(sent.split())

那么很可能下面的代码也是你需要计算频率的代码 =)

>>> from nltk.tokenize import word_tokenize
>>> from nltk.probability import FreqDist
>>> fdist = FreqDist(word.lower() for word in word_tokenize(sent))
>>> for i in fdist:
...     print i, fdist[i]

【讨论】：

伟大的黑客！我试试看！

【解决方案2】：

nltk.tokenize.word_tokenize(text) 只是一个瘦的wrapper function，它调用TreebankWordTokenizer 类实例的tokenize 方法，它显然使用简单的正则表达式来解析句子。

该类的文档指出：

这个分词器假设文本已经被分割成句子。任何句点——除了字符串末尾的句点—— 被假定为它们所附加的词的一部分（例如，对于缩写等），并且没有单独标记。

底层的tokenize方法本身很简单：

def tokenize(self, text):
    for regexp in self.CONTRACTIONS2:
        text = regexp.sub(r'\1 \2', text)
    for regexp in self.CONTRACTIONS3:
        text = regexp.sub(r'\1 \2 \3', text)

    # Separate most punctuation
    text = re.sub(r"([^\w\.\'\-\/,&])", r' \1 ', text)

    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)

    # Separate single quotes if they're followed by a space.
    text = re.sub(r"('\s)", r' \1', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)

    return text.split()

基本上，该方法通常所做的是将句点标记为单独的标记，如果它位于字符串的末尾：

>>> nltk.tokenize.word_tokenize("Hello, world.")
['Hello', ',', 'world', '.']

字符串中的任何句点都被标记为单词的一部分，假设它是一个缩写：

>>> nltk.tokenize.word_tokenize("Hello, world. How are you?") 
['Hello', ',', 'world.', 'How', 'are', 'you', '?']

只要这种行为是可以接受的，你应该没问题。

【讨论】：

啊哈，这种行为是不能接受的，我是用词频做文本分类的。多么详尽的回答，谢谢！
此建议现已过时。 nltk.word_tokenize() 现在在确定标记之前使用朋克句子标记器分割句子。