关于为什么这不起作用:
for sent in brown_sents:
for word, tag in sent:
if (word == 'to' and tag == 'TO-HL'):
print sent
if (word != 'to' and tag != 'TO-HL'):
break
在解释之前,您的代码并不是真正接近您想要的输出。这是因为您的 if-else 声明并没有真正满足您的需求。
首先你需要了解多个条件(即'if')在做什么。
# Loop through the sentence
for sent in brown_sents:
# Loop through each word with its POS
for word, tag in sent:
# For each sentence checks whether word and tag is in sentence:
if word == 'to' and tag == 'TO-HL':
print sent # If the condition is true, print sent
# After checking the first if, you continue to check the second if
# if word is not 'to' and tag is not 'TO-HL',
# you want to break out of the sentence. Note that you are still
# in the same iteration as the previous condition.
if word != 'to' and tag != 'TO-HL':
break
现在让我们从一些基本的if-else 声明开始:
>>> from nltk.corpus import brown
>>> first_sent = brown.tagged_sents()[0]
>>> first_sent
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), (u'County', u'NN-TL'), (u'Grand', u'JJ-TL'), (u'Jury', u'NN-TL'), (u'said', u'VBD'), (u'Friday', u'NR'), (u'an', u'AT'), (u'investigation', u'NN'), (u'of', u'IN'), (u"Atlanta's", u'NP$'), (u'recent', u'JJ'), (u'primary', u'NN'), (u'election', u'NN'), (u'produced', u'VBD'), (u'``', u'``'), (u'no', u'AT'), (u'evidence', u'NN'), (u"''", u"''"), (u'that', u'CS'), (u'any', u'DTI'), (u'irregularities', u'NNS'), (u'took', u'VBD'), (u'place', u'NN'), (u'.', u'.')]
>>> for word, pos in first_sent:
... if word != 'to' and pos != 'TO-HL':
... break
... else:
... print 'say hi'
...
>>>
从上面的例子中,我们遍历了句子中的每个单词+POS,并且在 EVERY 对单词-pos 处,if 条件将检查它是否不是单词 'to' 和不是 pos 'TO-HL',如果是这种情况,它会中断并且永远不会 say hi 给你。
因此,如果您将代码保持在if-else 条件下,您将总是 在不继续循环的情况下中断,因为to 不是句子中的第一个单词,并且匹配的 pos 不正确。
实际上,您的if 条件试图检查EVERY 词是否为“to”以及其词性标签是否为“TO-HL”。
你要做的是检查:
-
句子中是否有 'to' 而不是 每个 单词是否都是 'to' 然后检查
-
句子中的“to”是否包含你要找的词性标签
所以条件(1)需要的if条件是:
>>> from nltk.corpus import brown
>>> three_sents = brown.tagged_sents()[:3]
>>> for sent in three_sents:
... if 'to' in dict(sent):
... print sent
...
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
现在您知道if 'to' in dict(sent) 检查“to”是否在句子中。
然后检查条件(2):
>>> for sent in three_sents:
... if 'to' in dict(sent):
... if dict(sent)['to'] == 'TO':
... print sent
...
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]
>>> for sent in three_sents:
... if 'to' in dict(sent):
... if dict(sent)['to'] == 'TO-HL':
... print sent
...
>>>
现在您看到 if dict(sent)['to'] == 'TO-HL' AFTER 您已经检查了 if 'to' in dict(sent) 控制条件以检查 pos 限制。
但您意识到,如果您在句子 dict(sent)['to'] 中有 2 个“to”,则只会捕获最后一个“to”的 POS。这就是为什么您需要 defaultdict(list) 上一个答案中建议的原因。
确实没有干净的方法来执行检查,最有效的方法描述在前面的答案中,叹息。