返回给定短语的匹配列表答案

【问题标题】：Return a list of matches by given phrase返回给定短语的匹配列表
【发布时间】：2014-11-25 10:15:58
【问题描述】：

我正在尝试创建一种方法，该方法可以检查给定短语是否与短语列表中的至少一项匹配并返回它们。输入是短语、短语列表和同义词列表字典。关键是使其具有普遍性。

示例如下：

phrase = 'This is a little house'
dictSyns = {'little':['small','tiny','little'],
            'house':['cottage','house']}
listPhrases = ['This is a tiny house','This is a small cottage','This is a small building','I need advice']

我可以在这个返回 bool 的示例中创建一个可以执行此操作的代码：

if any('This'+' '+'is'+' '+'a'+x+' '+y == phrase for x in dictSyns['little'] for y in dictSyns['house']):
    print 'match'

第一点是我必须创建通用的函数（取决于结果）。第二个是我希望这个函数返回匹配短语的列表。

您能否给我一个建议，以便在这种情况下该方法返回 ['This is a tiny house','This is a small cottage']？

输出如下：

>>> getMatches(phrase, dictSyns, listPhrases)
['This is a tiny house','This is a small cottage']

【问题讨论】：

标签： python nlp text-processing synonym

【解决方案1】：

我会这样处理：

import itertools

def new_phrases(phrase, syns):
    """Generate new phrases from a base phrase and synonyms."""
    words = [syns.get(word, [word]) for word in phrase.split(' ')]
    for t in itertools.product(*words):
        yield ' '.join(t)

def get_matches(phrase, syns, phrases):
    """Generate acceptable new phrases based on a whitelist."""
    phrases = set(phrases)
    for new_phrase in new_phrases(phrase, syns):
        if new_phrase in phrases:
            yield new_phrase

代码的根源是new_phrases中words的赋值，它将phrase和syns转换为更有用的形式，一个列表，其中每个元素都是可接受的选择列表那个词：

>>> [syns.get(word, [word]) for word in phrase.split(' ')]
[['This'], ['is'], ['a'], ['small', 'tiny', 'little'], ['cottage', 'house']]

注意以下几点：

使用生成器更有效地处理大量组合（不是一次构建整个列表）；
使用set 进行高效（O(1)，与O(n) 用于列表）成员资格测试；
使用itertools.product根据syns生成phrase的可能组合（你也可以使用itertools.ifilter来实现它）；和
Style guide 合规性。

使用中：

>>> list(get_matches(phrase, syns, phrases))
['This is a small cottage', 'This is a tiny house']

需要考虑的事情：

字符大小写如何处理（例如"House of Commons"应该如何处理）？
标点符号呢？

【讨论】：

谢谢，它帮助很大。非常好的方法。案例：我会改变这一行：如果 new_phrase.lowercase() in [x.lowercase() for x in phrases].. 标点符号（逗号和点）：我会使用 .strip(', ').strip('。 ')
@Milan 请注意，您的小写方法效率非常低，因为它为每个new_phrase 重新处理phrases，不使用set，并且在生成新短语时不包括小写.您还必须仔细考虑到strip 的哪个步骤（请注意，您可以只是strip(",.")）。

【解决方案2】：

我是这样处理的：

for value in dictSyns:
    phrase = phrase + dictSyns[value]

for each_phrase in listPhrases:
    if any(word not in phrase for word in each_phrase.split()):
        pass
    else:
        print each_phrase

可能效率不高。它创建一个可接受的单词列表。然后它将每个字符串中的每个单词与该列表进行比较，如果没有不可接受的单词，它会打印该短语。

编辑：我也意识到这并不能检查语法意义。例如，短语“little little this a”仍然会返回正确。它只是检查每个单词。我将把它留在这里以示我的耻辱。

【讨论】：