如何从python中的列表中提取字符串的单词组合答案

【问题标题】：How to extract words combination of a string from a list in python如何从python中的列表中提取字符串的单词组合
【发布时间】：2020-05-14 22:17:57
【问题描述】：

我有一个这样的字符串：

my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"

还有这样的列表：

my_list = ['C#', 'Django' 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']

我想从 my_list 中提取每个可能的 my_string' 单词。

这是我所期望的：

['PHP', 'Software-Engineering', 'C', 'Oracle Cload', 'IT-Security market', 'Databases and Queries']

这是我尝试过的：

import re
try:
    user_inps = re.findall(r'\w+', my_string)
    extracted_inputs = set()
    for user_inp in user_inps:
        if user_inp.lower() in set(map(lambda x: x.lower(), my_list)):
            extracted_inputs.add(user_inp)
except Exception:
    extracted_inputs = set()

但我明白了：

['php', 'C']

效率也是我关心的问题。任何帮助将不胜感激。

【问题讨论】：

匹配是否应该区分大小写？
不，没关系。不算太大。包含数千个元素的列表（可能不算很大）
当您说效率时，我们关注的是哪些性能统计数据？我们是否有一个巨大的输入字符串和一个巨大的数组来匹配？
我刚刚编辑了我的答案
在您的预期输出中，您包含的单词不是列表中的元素。但是，它们位于列表中的某个元素中，例如 Oracle Cloud。您是否要匹配列表中部分元素匹配的任何值？

标签： python python-3.x string list

【解决方案1】：

由于该解决方案需要高效，并且我们从几千个开始，我建议您使用Bloom Filter 实施。

TL;DR

布隆过滤器是一种数据结构，旨在快速且高效地告诉您某个元素是否存在于集合中。 Read More or try out here.

代码：

from bloom_filter import BloomFilter  # pip install bloom-filter
from nltk.util import ngrams
import re


def clean(s):
    s = s.replace(",", " ").replace("-", " ").replace(".", " ").lower()
    return re.sub(r'\s+', ' ', s)


def clean_wo_space(s):
    s = s.replace(",", " ").replace("-", " ").replace(".", " ").lower()
    return re.sub(r'\s+', '', s)


def _initialize_bloom(phrases: list):
    bloom = BloomFilter(max_elements=1000, error_rate=0.1)
    for phrase in phrases:
        bloom.add(clean_wo_space(phrase))
    return bloom


def main():
    phrases_repo = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cloud', 'React', 'Flask',
                    'IT-Security market', 'Databases and Queries']

    input_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. C# should be another opetion, databases and queries"

    initialized_bloom = _initialize_bloom(phrases_repo)

    n_grams = set([' '.join(gram) for n in range(1, 4)
                   for gram in ngrams(clean(input_string).split(), n)])

    matches = [i for i in n_grams if clean_wo_space(i) in initialized_bloom]

    print(matches) # output ['c#', 'databases and queries', 'php', 'software engineering']


if __name__ == '__main__':
    main()

方法：

在应用程序启动时迭代您的 to_match 关键字存储库数组并通过规范化方法解析它，将单词小写，删除特殊字符等。
训练一个 bloom filter 对象，它将您的 normalized_to_match 存储到哈希中。
现在您已经准备好 bloom filter，您可以获取输入字符串并通过相同的规范化方法对其进行解析（以便两个字符串具有相同的格式和规范化）
将您的标准化输入转换为n-grams，其中 n 是您要匹配的短语的最大字数。

to_match = ["hello", "world", "Foo Bar", "Hey there it's me"] # n would be 4
上述步骤将为您提供所有可能存在的顺序单词组合。
现在只需迭代您的n_grams_array 以检查bloom filter 是否存在。如果返回true，则表示该词存在。

该方法的优点：

布隆过滤器是非常快速的查找。特别是对于大型数据集。
获得模糊性的灵活性（不是真的），但您可以将匹配的置信度配置为低以获得模糊匹配（或误报）

【讨论】：

【解决方案2】：

如果您想避免使用re，您可以使用纯 Python 完成大部分操作。对于数千个单词的列表，这将是很多快。

基本计划：清理标点符号，标记所有内容，使用集合进行匹配。对于小型应用程序，您可以修改关键字中的标记以省略诸如查找“和”之类的内容。

my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"
my_list = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']

# make table of tokens : phrases
keywords = {}
for word in my_list:
    # split each word into tokens
    tokens = {w.lower() for w in word.replace('-',' ').split()}
    for t in tokens:
        keywords[t] = word


# tokenize the string my_string
# note:  this is specifically tailored to your input with commas and hyphens, you may need to
#        make this more universal
my_string_tokens = {t.lower() for t in my_string.replace(',','').replace('-',' ').split()}

# now you can just intersect the sets, which is much more efficient than nested looping
matches = my_string_tokens & set(keywords.keys())

for match in matches:  # do what you want here...
    print(f'token: {match:20s}->  {keywords[match]}')

生产：

token: queries             ->  Databases and Queries
token: php                 ->  PHP
token: oracle              ->  Oracle Cload
token: engineering         ->  Software-Engineering
token: databases           ->  Databases and Queries
token: software            ->  Software-Engineering
token: and                 ->  Databases and Queries
token: security            ->  IT-Security market

【讨论】：

【解决方案3】：

import re

my_string = """
  Hello, I need to find php, software-engineering,
  html, security and safety things or even Oracle in your
  dataset. #C should be another opetion, databases and queries
"""

my_list = [
  'C#', 'Django', 'Software-Engineering',
  'C', 'PHP', 'Oracle Cload', 'React',
  'Flask', 'IT-Security market',
  'Databases and Queries'
]

result = set()

for list_item in my_list:
  if re.search(list_item, my_string, re.IGNORECASE):
    result.add(list_item)

print result

【讨论】：

【解决方案4】：

你应该考虑清理你的my_list，因为它有“and”、“IT”等常用词，同时它在一个地方有多个关键字，如“Database and Queries”

您的代码的问题在于它在“my_list”中寻找完全匹配的内容。如果您想找到“匹配”的单词，则需要遍历 my_list 中的每个子字符串。

extracted_inputs = set()

for list_keyword in my_list:
  keywords = list_keyword.replace("-", " ")
  for item in keywords.split():
    if re.search(item, my_string, re.IGNORECASE):
      extracted_inputs.add(list_keyword)
      break

结果：

{'C', 'IT-Security market', 'PHP', 'Software-Engineering', 'Databases and Queries', 'Oracle Cload'}

【讨论】：

【解决方案5】：

我们可以拆分您列表中的关键字，并在您的string.lower() 中搜索每个元素。鉴于有 hypens，我们还需要检查并拆分 hypens。

我还假设您忘记在 Django 之后的列表中添加 ,。

my_string = "Hello, I need to find php, software-engineering, html, security and safety things or even Oracle in your dataset. #C should be another opetion, databases and queries"
my_list = ['C#', 'Django', 'Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'React', 'Flask', 'IT-Security market', 'Databases and Queries']

result =[]

for idx, keyword in enumerate(my_list):
    if '-' in keyword:
        keyword = keyword.split('-')
    else:
        keyword = keyword.split()
    for word in keyword:
        if word.lower() in my_string.lower() and my_list[idx] not in result and len(word) > 1:
            result.append(my_list[idx])


result
['Software-Engineering', 'C', 'PHP', 'Oracle Cload', 'IT-Security market', 'Databases and Queries']

【讨论】：

感谢您的回答。但是，我在第一次尝试时发现了一个错误。例如，如果您将 me 添加到字符串并在列表中添加 'T-Shirts & Merchandise'，那么 'T-Shirts & Merchandise' 将作为输出返回但不应该因为我不等于**商品**
你可以通过添加'and if len(word) > 1'来解决这个问题，因为我们不想搜索单个字母。我已经更新了答案中的代码。