检查字符串是否为大量字符串的子字符串的可扩展解决方案答案

【问题标题】：Scalable solution to check if string a a substring of large number of strings检查字符串是否为大量字符串的子字符串的可扩展解决方案
【发布时间】：2021-03-14 04:17:51
【问题描述】：

如何有效地检查字符串是否是任何给定字符串的子字符串？

天真的方法是：

def is_substring_of_any(phrase, known_phrases):
    for known_phrase in known_phrases:
        if phrase in known_phrase:
            return True
    return False

但是，对于大量字符串，它的扩展性很差：

known_phrases = [phrase_generator(15) for _ in tqdm(range(10 ** 6), desc="Generating known phrases")]
phrases_to_check = [phrase_generator(7) for _ in tqdm(range(10 ** 5), desc="Generating phrases to check")]
for phrase in tqdm(phrases_to_check):
    is_substring_of_any(phrase, known_phrases)

给予 >1 小时来处理它们：

Generating known phrases: 100%|██████████| 1000000/1000000 [00:13<00:00, 73370.36it/s]
Generating phrases to check: 100%|██████████| 100000/100000 [00:00<00:00, 137534.76it/s]
  6%|▌         | 5991/100000 [04:23<1:11:20, 21.96it/s]

有没有办法在不设置额外基础架构的情况下更快地运行它？

【问题讨论】：

如果短语在 known_phrases 中，你不能这样做吗？我知道您在 is_substring_of_any 函数之外有一个 for 循环？
你确定瓶颈是搜索，而不是tqdm 或phrase_generator 吗？您要优化什么 - 例如，known_phrases 是否经常更改，或者您只需要能够检查每个 phrase 与一组静态的、大量的 known_phrases？不“建立额外的基础设施”的原因是什么？
@marxmacher 那将使用完全匹配。我正在寻找的是找到“瓶子”匹配[“瓶颈”，“卡通”，“xyz”]中的任何单词。 "bottle" 匹配 "bottleneck" 因为它是它的子字符串
那我很糟糕。遍历字符串列表并在它们上搜索子字符串似乎是最快的方法......所以瓶颈必须是别的东西。
哦，“基础设施”我以为你的意思是包括数据结构。如果有大量的一次性设置成本来构建查找结构，是否可以？（接近投票是我的，因为需要更多信息才能给出最佳答案。）

标签： python

【解决方案1】：

Boyer-Moore 字符串搜索算法

Boyer-Moore 字符串搜索算法是一种高效的字符串搜索算法，是实用字符串搜索文献的标准基准。

Boyer-Moore 算法通过在不同的对齐方式上执行显式字符比较来搜索 $T$ 中出现的 $P$。 Boyer–Moore 使用通过预处理 P 获得的信息来跳过尽可能多的对齐，而不是对所有对齐进行暴力搜索。

该算法的关键在于，如果将模式的结尾与文本进行比较，则可以沿着文本跳转，而不是检查文本的每个字符。这样做的原因是在将模式与文本对齐时，模式的最后一个字符与文本中的字符进行比较。如果字符不匹配，则无需继续沿文本向后搜索。如果文本中的字符与模式中的任何字符都不匹配，则文本中要检查的下一个字符位于文本中较远的 n 个字符处，其中 n 是模式的长度。如果文本中的字符在模式中，则沿文本进行模式的部分移动以沿匹配字符排列并重复该过程。沿着文本跳转进行比较，而不是检查文本中的每个字符，减少了必须进行的比较次数，这是算法效率的关键。

更正式地说，算法从对齐 $k=n$ 开始，因此 P 的开头与 T 的开头对齐。然后从 P 中的索引 n 和 T 中的 k 开始比较 P 和 T 中的字符，移动落后。字符串从 P 的结尾匹配到 P 的开头。比较继续进行，直到到达 P 的开头（这意味着存在匹配）或发生不匹配时对齐向前移动（向右）根据许多规则允许的最大值。在新的比对处再次进行比较，并重复该过程，直到比对移动到 T 的末尾，这意味着将找不到进一步的匹配项。

使用在 P 的预处理期间生成的表，将移位规则实现为恒定时间表查找。

Python 实现：

from typing import *
# This version is sensitive to the English alphabet in ASCII for case-insensitive matching.
# To remove this feature, define alphabet_index as ord(c), and replace instances of "26"
# with "256" or any maximum code-point you want. For Unicode you may want to match in UTF-8
# bytes instead of creating a 0x10FFFF-sized table.

ALPHABET_SIZE = 26

def alphabet_index(c: str) -> int:
    """Return the index of the given character in the English alphabet, counting from 0."""
    val = ord(c.lower()) - ord("a")
    assert val >= 0 and val < ALPHABET_SIZE
    return val

def match_length(S: str, idx1: int, idx2: int) -> int:
    """Return the length of the match of the substrings of S beginning at idx1 and idx2."""
    if idx1 == idx2:
        return len(S) - idx1
    match_count = 0
    while idx1 < len(S) and idx2 < len(S) and S[idx1] == S[idx2]:
        match_count += 1
        idx1 += 1
        idx2 += 1
    return match_count

def fundamental_preprocess(S: str) -> List[int]:
    """Return Z, the Fundamental Preprocessing of S.

    Z[i] is the length of the substring beginning at i which is also a prefix of S.
    This pre-processing is done in O(n) time, where n is the length of S.
    """
    if len(S) == 0:  # Handles case of empty string
        return []
    if len(S) == 1:  # Handles case of single-character string
        return [1]
    z = [0 for x in S]
    z[0] = len(S)
    z[1] = match_length(S, 0, 1)
    for i in range(2, 1 + z[1]):  # Optimization from exercise 1-5
        z[i] = z[1] - i + 1
    # Defines lower and upper limits of z-box
    l = 0
    r = 0
    for i in range(2 + z[1], len(S)):
        if i <= r:  # i falls within existing z-box
            k = i - l
            b = z[k]
            a = r - i + 1
            if b < a:  # b ends within existing z-box
                z[i] = b
            else:  # b ends at or after the end of the z-box, we need to do an explicit match to the right of the z-box
                z[i] = a + match_length(S, a, r + 1)
                l = i
                r = i + z[i] - 1
        else:  # i does not reside within existing z-box
            z[i] = match_length(S, 0, i)
            if z[i] > 0:
                l = i
                r = i + z[i] - 1
    return z

def bad_character_table(S: str) -> List[List[int]]:
    """
    Generates R for S, which is an array indexed by the position of some character c in the
    English alphabet. At that index in R is an array of length |S|+1, specifying for each
    index i in S (plus the index after S) the next location of character c encountered when
    traversing S from right to left starting at i. This is used for a constant-time lookup
    for the bad character rule in the Boyer-Moore string search algorithm, although it has
    a much larger size than non-constant-time solutions.
    """
    if len(S) == 0:
        return [[] for a in range(ALPHABET_SIZE)]
    R = [[-1] for a in range(ALPHABET_SIZE)]
    alpha = [-1 for a in range(ALPHABET_SIZE)]
    for i, c in enumerate(S):
        alpha[alphabet_index(c)] = i
        for j, a in enumerate(alpha):
            R[j].append(a)
    return R

def good_suffix_table(S: str) -> List[int]:
    """
    Generates L for S, an array used in the implementation of the strong good suffix rule.
    L[i] = k, the largest position in S such that S[i:] (the suffix of S starting at i) matches
    a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount to
    shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]]
    matches the substring of T matched by a suffix of P in the previous match attempt.
    Specifically, if the mismatch took place at position i-1 in P, the shift magnitude is given
    by the equation len(P) - L[i]. In the case that L[i] = -1, the full shift table is used.
    Since only proper suffixes matter, L[0] = -1.
    """
    L = [-1 for c in S]
    N = fundamental_preprocess(S[::-1])  # S[::-1] reverses S
    N.reverse()
    for j in range(0, len(S) - 1):
        i = len(S) - N[j]
        if i != len(S):
            L[i] = j
    return L

def full_shift_table(S: str) -> List[int]:
    """
    Generates F for S, an array used in a special case of the good suffix rule in the Boyer-Moore
    string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a
    prefix of S. In the cases it is used, the shift magnitude of the pattern P relative to the
    text T is len(P) - F[i] for a mismatch occurring at i-1.
    """
    F = [0 for c in S]
    Z = fundamental_preprocess(S)
    longest = 0
    for i, zv in enumerate(reversed(Z)):
        longest = max(zv, longest) if zv == i + 1 else longest
        F[-i - 1] = longest
    return F

def string_search(P, T) -> List[int]:
    """
    Implementation of the Boyer-Moore string search algorithm. This finds all occurrences of P
    in T, and incorporates numerous ways of pre-processing the pattern to determine the optimal
    amount to shift the string and skip comparisons. In practice it runs in O(m) (and even
    sublinear) time, where m is the length of T. This implementation performs a case-insensitive
    search on ASCII alphabetic characters, spaces not included.
    """
    if len(P) == 0 or len(T) == 0 or len(T) < len(P):
        return []

    matches = []

    # Preprocessing
    R = bad_character_table(P)
    L = good_suffix_table(P)
    F = full_shift_table(P)

    k = len(P) - 1      # Represents alignment of end of P relative to T
    previous_k = -1     # Represents alignment in previous phase (Galil's rule)
    while k < len(T):
        i = len(P) - 1  # Character to compare in P
        h = k           # Character to compare in T
        while i >= 0 and h > previous_k and P[i] == T[h]:  # Matches starting from end of P
            i -= 1
            h -= 1
        if i == -1 or h == previous_k:  # Match has been found (Galil's rule)
            matches.append(k - len(P) + 1)
            k += len(P) - F[1] if len(P) > 1 else 1
        else:  # No match, shift by max of bad character and good suffix rules
            char_shift = i - R[alphabet_index(T[h])][i]
            if i + 1 == len(P):  # Mismatch happened on first attempt
                suffix_shift = 1
            elif L[i + 1] == -1:  # Matched suffix does not appear anywhere in P
                suffix_shift = len(P) - F[i + 1]
            else:               # Matched suffix appears in P
                suffix_shift = len(P) - 1 - L[i + 1]
            shift = max(char_shift, suffix_shift)
            previous_k = k if shift >= i + 1 else previous_k  # Galil's rule
            k += shift
    return matches

【讨论】：