使用 python 针对大量概念字符串查找输入字符串中的所有概念答案

【问题标题】：Finding all the concepts in an input string against a large list of concept strings using python使用 python 针对大量概念字符串查找输入字符串中的所有概念
【发布时间】：2018-12-08 02:01:56
【问题描述】：

我在数据库中有大量（例如 3000 万）概念字符串（每个字符串最多 13 个单词）。给定一个输入字符串（可能最多 3 个句子），我想从数据库中找到输入字符串中可用的所有概念。

我为此目的使用 python。将数据库中的所有概念加载到列表中。遍历概念列表并尝试查找该概念是否在输入字符串中可用。由于我必须按顺序搜索它，这个过程需要很长时间，而且我必须为数百个输入字符串进行搜索。

为了修剪一些迭代，我对输入字符串进行了标记，并尝试仅加载具有任何一个标记的概念，并且概念的长度必须小于或等于输入字符串的长度。它需要一个 sql 查询来将这些简短列出的概念加载到列表中。该列表仍然可能包含 2000 万个概念。这个过程并没有那么快。

知道如何提高这个过程的效率吗？

为了更好的可视化，我举了一个 Pythonic 的小例子：

 inputString = "The cow is a domestic animal. It has four legs, one tail, two eyes"
#load concept list from the database that have any of the words in input string (after removing stop words). Assume the list is as follows.

concepts = ["cow", "domestic animal", "domestic bird", "domestic cat", "domestic dog", "one eye", "two eyes", "two legs", "four legs", "two ears"]

for c in concepts:
    if c in inputString:
        print ('found ' + c + ' in ' + inputString)

如果您能给我一些建议以提高效率，那就太好了。

【问题讨论】：

这可能不是您要寻找的答案，但print 语句非常耗费资源。删除您的打印语句，将其保存到列表中，然后在最后打印列表。它会明显更快。
感谢您的意见。非常感激。我只是出于示例目的展示了它，但我会记住。
没问题。我在处理大型数据集时遇到了类似的问题，当我删除 print 语句时，它的执行速度提高了大约 5-10 倍。

标签： python performance list search substring

【解决方案1】：

您应该使用集合，它比列表和全文搜索在查找项目方面要快得多。

将这些概念放入集合的字典中，按单词数索引。然后将inputString拆分成一个词的列表，然后用这个列表上词数的滚动窗口来测试这些词是否存在于相同词数的索引集合中。

所以给定以下初始化：

from collections import defaultdict
import re
inputString = "The cow is a domestic animal. It has four legs, one tail, two eyes"
concepts = ["cow", "domestic animal", "domestic bird", "forever and ever", "practice makes perfect", "i will be back"]

我们将concepts 分解为一个集合字典，由集合中包含的概念的单词数索引：

concept_sets = defaultdict(set)
for concept in concepts:
    concept_sets[len(concept.split())].add(concept)

这样concept_sets就变成了：

{1: {'cow'}, 2: {'domestic bird', 'domestic animal'}, 3: {'practice makes perfect', "forever and ever"}, 4: {'i will be back'}}

然后我们将inputString 转换为小写单词列表，以便匹配不区分大小写。请注意，您可能希望在此处优化正则表达式，以便它可以包含某些其他字符作为“单词”。

input_words = list(map(str.lower, re.findall(r'[a-z]+', inputString, re.IGNORECASE)))

最后，我们循环遍历concept_sets中的每个概念集及其单词数，并在相同单词数的滚动窗口中从输入中遍历单词列表，并测试该单词是否存在于集合中.

for num_words, concept_set in concept_sets.items():
    for i in range(len(input_words) - num_words + 1):
        words = ' '.join(input_words[i: i + num_words])
        if words in concept_set:
            print("found '%s' in '%s'" % (words, inputString))

这个输出：

found 'cow' in 'The cow is a domestic animal. It has four legs, one tail, two eyes'
found 'domestic animal' in 'The cow is a domestic animal. It has four legs, one tail, two eyes'

【讨论】：

感谢您的详细回答。实际上，我的主要场景有点不同，作为其中的一部分，我还删除了所有空格，以便算法可以捕获两个单词是否由于拼写错误而组合在一起。但是您的想法似乎很有趣，我将尝试将我的方法映射到它并查看性能。再次感谢。
不客气。如果您在测试中发现它是一个好的解决方案，请将答案标记为已接受。我还想知道这个解决方案比你原来的解决方案快多少倍。干杯。
我现在已经用一组字典实现了这个解决方案，并通过预先放入字典并保存在 cPickle 中来删除 SQL 查询。后来我将字典从 cPickle 加载到主内存中。使用了一些修剪技术来减少字典访问。初始加载需要时间，但搜索速度要快得多。接下来，我将致力于减少 cPickle 的初始加载时间。看到了一些处理加载的方法（stackoverflow.com/questions/26860051/…），但现在超出了范围。谢谢！