Python：如何在字符串列表中找到匹配最多的字符串答案

【问题标题】：Python: How to find the string with most matches in a list of stringsPython：如何在字符串列表中找到匹配最多的字符串
【发布时间】：2012-03-14 18:57:46
【问题描述】：

我会尽量详细解释我需要什么：

我正在使用 feedparser 在 Python 中解析一个 RSS 提要。这个提要当然有一个项目列表，包括标题、链接和描述，就像普通的 RSS 提要一样。

另一方面，我有一个字符串列表，其中包含一些我需要在项目描述中找到的关键字。

我需要做的是找到关键字匹配最多的项目

例子：

RSS 提要

<channel>
    <item>
        <title>Lion</title>
        <link>...</link>
        <description>
            The lion (Panthera leo) is one of the four big cats in the genus 
            Panthera, and a member of the family Felidae.
        </description>
    </item>
    <item>
        <title>Panthera</title>
        <link>...</link>
        <description>
            Panthera is a genus of the Felidae (cats), which contains 
            four well-known living species: the tiger, the lion, the jaguar, and the leopard.
        </description>
    </item>
    <item>
        <title>Cat</title>
        <link>...</link>
        <description>
            The domestic cat is a small, usually furry, domesticated, 
            carnivorous mammal. It is often called the housecat, or simply the 
            cat when there is no need to distinguish it from other felids and felines.
        </description>
    </item>
</channel>

关键字列表

['cat', 'lion', 'panthera', 'family']

所以在这种情况下，匹配最多（唯一）的项目是第一个，因为它包含所有 4 个关键字（不管它说“猫”而不是“猫”，我只需要找到字符串中的文字关键字）

让我澄清一下，即使某些描述包含 'cat' 关键字 100 次（并且没有其他关键字），这也不会是赢家，因为我正在寻找包含最多的关键字，而不是最多的次数出现一个关键字。

现在，我正在遍历 rss 项目并“手动”执行，计算关键字出现的次数（但我遇到了上一段中提到的问题）。

我是 Python 的新手，我来自另一种语言 (C#)，所以如果这很琐碎，我很抱歉。

您将如何解决这个问题？

【问题讨论】：

下面的答案都很棒，但要注意部分匹配（concatenate 是否算作 cat 的出现？）和大写（Cat 算不算匹配？@987654326 怎么样？ @?)
是的，现在 'concatenate' 算作 'cat' 的出现，不必区分大小写。感谢您的警告。

标签： python string list rss string-matching

【解决方案1】：

texts = [ "The lion (Panthera leo) ...", "Panthera ...", "..." ]
keywords  = ['cat', 'lion', 'panthera', 'family']

# gives the count of `word in text`
def matches(text):
    return sum(word in text.lower() for word in keywords)

# or inline that helper function as a lambda:
# matches = lambda text:sum(word in text.lower() for word in keywords)

# print the one with the highest count of matches
print max(texts, key=matches)

【讨论】：

惊人的解决方案。你介意解释一下这段代码是如何工作的吗？我现在正在阅读 Python 中的 lambda。
缺少小写转换。
@NiklasB。是的，你是对的，我刚刚将 max 函数中的 'texts' 参数替换为：[x.lower() for x in texts]
@emzero：你也可以sum(word in text.lower() for word in keywords)
@NiklasB。对，更好，因为它打印原始字符串，而不是小写字符串。谢谢。

【解决方案2】：

其他答案非常优雅，但对于现实世界来说可能太简单了。它们可能会破坏的一些方式包括：

部分单词匹配 - 'cat' 是否应该匹配 'concatenate'？ “猫”呢？
区分大小写 - 'cat' 是否应该匹配 'CAT'？ “猫”呢？

我下面的解决方案允许这两种情况。

import re

test_text = """
Cat?

The domestic cat is a small, usually furry, domesticated, 
carnivorous mammal. It is often called the housecat, or simply the 
cat when there is no need to distinguish it from other felids and felines.
"""

wordlist = ['cat','lion','feline']
# Construct regexp like r'\W(cat|lionfeline)s?\W'
# Matches cat, lion or feline as a whole word ('cat' matches, 'concatenate'
# does not match)
# also allow for an optional trailing 's', so that both 'cat' and 'cats' will
# match.
wordlist_re = r'\W(' + '|'.join(wordlist) + r')(s?)\W'

# Get list of all matches from text. re.I means "case insensitive".
matches = re.findall(wordlist_re, test_text, re.I)

# Build list of matched words. the `[0]` means first capture group of the regexp
matched_words = [ match[0].lower() for match in matches]

# See which words occurred
unique_matched_words = [word for word in wordlist if word in matched_words]

# Count unique words
num_unique_matched_words = len(unique_matched_words)

输出如下：

>>> wordlist_re
'\\W(cat|lion|feline)(s?)\\W'
>>> matches
[('Cat', ''), ('cat', ''), ('cat', ''), ('feline', 's')]
>>> matched_words
['cat', 'cat', 'cat', 'feline']
>>> unique_matched_words
['cat', 'feline']
>>> num_unique_matched_words
2
>>>

【讨论】：

根据问题，部分匹配是可以的，搜索应该不区分大小写。
旁注：不区分大小写的正则表达式有时是个坏主意（由于回溯，它们有时会变慢。）您可以先lower() 整个字符串，但要注意 unicode 字符串（'Главное в новостях'.lower() 是什么?)