如何在大文本 python 中有效地搜索相似的子字符串？答案

【问题标题】：How to efficiently search for similar substring in a large text python?如何在大文本 python 中有效地搜索相似的子字符串？
【发布时间】：2022-12-18 09:11:29
【问题描述】：

让我试着用一个例子来解释我的问题，我有一个很大的语料库和一个子字符串，如下所示，

corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""

substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""

子串和语料库都非常相似，但并不精确，

如果我做类似的事情，

import re
re.search(substring, corpus, flags=re.I) # this will fail substring is not exact but rather very similar

在语料库中，子字符串如下所示，与我的子字符串有点不同，因为正则表达式搜索失败，有人可以建议一个非常好的类似子字符串查找的替代方案，

until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now

我确实尝试过 difflib 库，但它不能满足我的用例。

一些背景资料，

我现在拥有的子字符串是前段时间使用正则表达式re.sub("[^a-zA-Z]", " ", corpus)从预处理语料库中获得的。

但是现在我需要使用那个子字符串，我必须在语料库文本中进行反向查找并在语料库中找到开始和结束索引。

【问题讨论】：

如果它们只是特殊字符不同，您可以删除它们并在之后匹配 - reduced_string =re.sub("[^A-Z]", "", corpus,0,re.IGNORECASE)
@Chris 我的用例是我需要在语料库中找到子字符串而不删除语料库文本中的特殊字符。我得到的子字符串是从预处理的语料库中获得的，这个正则表达式re.sub("[^a-zA-Z]", " ", corpus)，我需要的是反向查找
您不需要删除特殊字符。您可以制作这些字符及其索引的映射，然后以与获取子字符串时相同的方式替换它们，搜索子字符串，获取开始结束索引，然后从映射中替换回特殊字符。
@IgorMoraru 你能用我的数据举例说明如何实现吗？
@user_12 我更新了我对你编辑过的问题的回答

标签： python python-3.x string

【解决方案1】：

如果字符串仅相差一个字符，您将无法找到完全匹配的字符串，但您可以找到相似的字符串。

所以在这里我使用内置的 difflib SequenceMatcher 来检查两个不同字符串的相似性。

如果您需要子字符串在语料库中的起始位置的索引 - 可以轻松添加。如果您有任何问题，请发表评论。

希望能帮助到你。 - 适应您编辑的问题

import re
from difflib import SequenceMatcher


def similarity(a, b) -> float:
    """Return similarity between 2 strings"""
    return SequenceMatcher(None, a, b).ratio()


def find_similar_match(a, b, threshold=0.7) -> list:
    """Find string b in a - while the strings being different"""
    corpus_lst = a.split()
    substring_lst = b.split()
    start_indices = [i for i, x in enumerate(corpus_lst) if re.sub("[^a-zA-Z]", "", x) == substring_lst[0]]
    end_indices = [i for i, x in enumerate(corpus_lst) if re.sub("[^a-zA-Z]", "", x) == substring_lst[-1]]

    max_sim = 0
    for start_idx in start_indices:
        for end_idx in end_indices:
            corpus_search_string = " ".join(
                corpus_lst[start_idx: end_idx])
            sim = similarity(corpus_search_string, " ".join(substring_lst))
            if sim > max_sim:
                print(f"Found a match with similarity : {sim}")
                print([start_idx, end_idx])
                result = [start_idx, end_idx]
    
    return result

调用find_similar_match(corpus, substring)的结果：

Found a match with similarity : 0.8429752066115702
[38, 156]

【讨论】：

一个小的性能说明：对于大量输入，重复查找缓存的编译正则表达式（在 Python 层）而不是预编译和使用编译正则表达式（C 加速）的成本会有所不同。您可能想在函数顶部执行 nonalpha = re.compile(r"[^a-zA-Z]")，然后将 re.sub("[^a-zA-Z]", "", x) 替换为 nonalpha.sub("", x)。您还想将 " ".join(substring_lst) 移到循环外（它永远不会改变，但您可能会重建它许多次）。
@ShadowRanger 谢谢。
@Chris 非常感谢。这似乎适用于我的示例，我不确定它在更大数据上的效率如何，并且还必须测试它在任何情况下都会失败。对于任何有效的方法，我都会暂时保留这个问题

【解决方案2】：

不完全是最好的解决方案，但这可能会有所帮助。

match = SequenceMatcher(None, corpus, substring).find_longest_match(0, len(corpus), 0, len(substring))

print(match)
print(corpus[match.a:match.a + match.size])
print(substring[match.b:match.b + match.size])

【讨论】：

不完全是我要找的，我试过了。我想在我的语料库中找到子字符串的开始和结束索引。但是您不能使用 re.search，因为它不是完全匹配，而是类似的子字符串搜索。
是的，@Chris 根据更新的问题有更好的解决方案。

【解决方案3】：

这可能会帮助您根据

语料库中子字符串中单词的百分比。

下面的代码旨在：

使用子串作为词袋
在语料库中找到这些词（如果找到 - 将它们变成大写）
显示语料库中的修改
计算语料库中修饰词的百分比
显示子串中不在语料库中的词数

这样你就可以看到语料库中匹配了哪些子串词，然后确定词的相似度百分比（但不一定是正确的顺序）。

代码：

import re
corpus = """very quick service, polite workers(cory, i think that's his name), i basically just drove there and got a quote(which seems to be very fair priced), then dropped off my car 4 days later(because they were fully booked until then), then i dropped off my car on my appointment day, then the same day the shop called me and notified me that the the job is done i can go pickup my car. when i go checked out my car i was amazed by the job they've done to it, and they even gave that dirty car a wash( prob even waxed it or coated it, cuz it was shiny as hell), tires shine, mats were vacuumed too. i gave them a dirty, broken car, they gave me back a what seems like a brand new car. i'm happy with the result, and i will def have all my car's work done by this place from now."""

substring = """until then then i dropped off my car on my appointment day then the same day the shop called me and notified me that the the job is done i can go pickup my car when i go checked out my car i was amazed by the job they ve done to it and they even gave that dirty car a wash prob even waxed it or coated it cuz it was shiny as hell tires shine mats were vacuumed too i gave them a dirty broken car they gave me back a what seems like a brand new car i m happy with the result and i will def have all my car s work done by this place from now"""

sub_list = set(substring.split(" "))
unused_words = []
for word in sub_list:
    if word in corpus:
        r = r"" + word + r""
        ru = f"{word.upper()}"
        corpus = re.sub(r, ru, corpus)
    else:
        unused_words.append(word)

print(corpus)

lower_strings = len(re.findall("[a-z']+", corpus))
upper_strings = len(re.findall("[A-Z']+", corpus))
print(f"
Words Matched = {(upper_strings)/(upper_strings + lower_strings)*100:.1f}%")
print(f"Unused Substring words: {len(unused_words)}")

输出：

very quick service, polite workers(cory, I think THAT'S his name), I
basically just drove there AND got A quote(which SEEMS TO be very fair
priced), THEN DROPPED OFF MY CAR 4 days later(because THEY WERE fully
booked UNTIL THEN), THEN I DROPPED OFF MY CAR ON MY APPOINTMENT DAY, THEN
THE SAME DAY THE SHOP CALLED ME AND NOTIFIED ME THAT THE THE JOB IS DONE I
CAN GO PICKUP MY CAR. WHEN I GO CHECKED OUT MY CAR I WAS AMAZED BY THE JOB
THEY'VE DONE TO IT, AND THEY EVEN GAVE THAT DIRTY CAR A WASH( PROB EVEN
WAXED IT OR COATED IT, CUZ IT WAS SHINY AS HELL), TIRES SHINE, MATS WERE 
VACUUMED TOO. I GAVE THEM A DIRTY, BROKEN CAR, THEY GAVE ME BACK A WHAT 
SEEMS LIKE A BRAND NEW CAR. I'M HAPPY WITH THE RESULT, AND I WILL DEF HAVE 
ALL MY CAR'S WORK DONE BY THIS PLACE FROM NOW.

Words Matched = 82.1%
Unused Substring words: 0

【讨论】：