从带有特定单词的带有后标记的语料库中提取句子，标记组合答案

【问题标题】：extracting sentences from pos-tagged corpus with certain word, tag combos从带有特定单词的带有后标记的语料库中提取句子，标记组合
【发布时间】：2014-11-22 00:58:12
【问题描述】：

我正在使用棕色语料库，特别是“新闻”中的标记句子。我发现“to”是词标签最模糊的词（TO、IN、TO-HL、IN-HL、IN-TL、NPS）。我正在尝试编写一个代码，该代码将从语料库中为与“to”关联的每个标签打印一个句子。句子不需要“清理”标签，只需包含“to”和每个相关的 pos-tags。

brown_sents = nltk.corpus.brown.tagged_sents(categories="news")
for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == "IN"):
            print sent

我只用其中一个 pos-tags 尝试了上面的代码，看看它是否有效，但它会打印所有的实例。我需要它只打印与单词匹配的第一个找到的句子，标记然后停止。我试过这个：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'IN'):
            print sent
        if (word != 'to' and tag != 'IN'):
            break

这适用于这个 pos-tag，因为它是第一个与“to”相关的，但如果我使用：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

它什么也不返回。我想我很亲密——愿意帮忙吗？

【问题讨论】：

在您更改问题后添加了补充答案。希望对您有所帮助。

标签： python loops nltk pos-tagger

【解决方案1】：

您可以继续添加到您当前的代码中，但您的代码没有考虑这些事情：

如果“to”在具有相同或不同 POS 的句子中多次出现会怎样？
如果同一个词性的“to”在句子中出现两次，是否要打印两次？
如果“to”出现在句子的第一个单词中并且大写，会发生什么情况？

如果你想坚持你的代码试试这个：

from nltk.corpus import brown

brown_sents = brown.tagged_sents(categories="news")

def to_pos_sent(pos):
    for sent in brown_sents:
        for word, tag in sent:
            if word == 'to' and tag == pos:
                yield sent

for sent in to_pos_sent('TO'):
    print sent

for sent in to_pos_sent('IN'):
    print sent

我建议您将句子存储在defaultdict(list)，然后您可以随时检索它们。

from nltk.corpus import brown
from collections import Counter, defaultdict

sents_with_to = defaultdict(list)

to_counts = Counter()

for i, sent in enumerate(brown.tagged_sents(categories='news')):
    # Check if 'to' is in sentence.
    uniq_words = dict(sent)
    if 'to' in uniq_words or 'To' in uniq_words:
        # Iterate through the sentence to find 'to'
        for word, pos in sent:
            if word.lower()=='to':
                # Flatten the sentence into a string
                sents_with_to[pos].append(sent)
                to_counts[pos]+=1


for pos in sents_with_to:
    for sent in sents_with_to[pos]:
        print pos, sent

访问特定词性的句子：

for sent in sents_with_to['TO']:
    print sent

您会意识到，如果带有特定词性的“to”在句子中出现两次。它在sents_with_to[pos] 中记录了两次。如果要删除它们，请尝试：

sents_with_to_and_TO = set(" ".join(["#".join(word, pos) for word, pos in sent] for sent in sents_with_to['TO']))

【讨论】：

感谢@alvas - 但是是否有一些运算符我可以添加我现有代码的末尾以便它只打印它遇到的第一个示例？您编写的代码有效，但我知道在我现有的代码中添加了一些简单的类型。快把我逼疯了！
作为一种有效的循环方式，使用yield 一次返回一个句子，而不是return 一次返回所有句子。

【解决方案2】：

关于为什么这不起作用：

for sent in brown_sents:
    for word, tag in sent:
        if (word == 'to' and tag == 'TO-HL'):
            print sent
        if (word != 'to' and tag != 'TO-HL'):
            break

在解释之前，您的代码并不是真正接近您想要的输出。这是因为您的 if-else 声明并没有真正满足您的需求。

首先你需要了解多个条件（即'if'）在做什么。

# Loop through the sentence
for sent in brown_sents:
  # Loop through each word with its POS
  for word, tag in sent:
    # For each sentence checks whether word and tag is in sentence:
    if word == 'to' and tag == 'TO-HL':
      print sent # If the condition is true, print sent
    # After checking the first if, you continue to check the second if
    # if word is not 'to' and tag is not 'TO-HL', 
    # you want to break out of the sentence. Note that you are still
    # in the same iteration as the previous condition.
   if word != 'to' and tag != 'TO-HL':
     break

现在让我们从一些基本的if-else 声明开始：

>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
...     if word != 'to' and pos != 'TO-HL':
...             break
...     else:
...             print 'say hi'
... 
>>>

从上面的例子中，我们遍历了句子中的每个单词+POS，并且在 EVERY 对单词-pos 处，if 条件将检查它是否不是单词 'to' 和不是 pos 'TO-HL'，如果是这种情况，它会中断并且永远不会 say hi 给你。

因此，如果您将代码保持在if-else 条件下，您将总是在不继续循环的情况下中断，因为to 不是句子中的第一个单词，并且匹配的 pos 不正确。

实际上，您的if 条件试图检查EVERY 词是否为“to”以及其词性标签是否为“TO-HL”。

你要做的是检查：

句子中是否有 'to' 而不是每个单词是否都是 'to' 然后检查
句子中的“to”是否包含你要找的词性标签

所以条件（1）需要的if条件是：

>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

现在您知道if 'to' in dict(sent) 检查“to”是否在句子中。

然后检查条件（2）：

>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO':
...                     print sent
... 
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
...     if 'to' in dict(sent):
...             if dict(sent)['to'] == 'TO-HL':
...                     print sent
... 
>>>

现在您看到 if dict(sent)['to'] == 'TO-HL' AFTER 您已经检查了 if 'to' in dict(sent) 控制条件以检查 pos 限制。

但您意识到，如果您在句子 dict(sent)['to'] 中有 2 个“to”，则只会捕获最后一个“to”的 POS。这就是为什么您需要 defaultdict(list) 上一个答案中建议的原因。

确实没有干净的方法来执行检查，最有效的方法描述在前面的答案中，叹息。

【讨论】：