查找两个字符串共享的所有 n 字长子串的最大长度答案

【问题标题】：Find maximum length of all n-word-length substrings shared by two strings查找两个字符串共享的所有 n 字长子串的最大长度
【发布时间】：2013-12-14 13:49:44
【问题描述】：

我正在制作一个 Python 脚本，该脚本可以找到两个字符串共享的所有 n 字长子字符串的（可能最长的）长度，而忽略尾随标点符号。给定两个字符串：

“这是一个示例字符串”

"这也是一个示例字符串"

我希望脚本识别这些字符串有一个共同的 2 个单词序列（“这是”），然后是一个共同的 3 个单词序列（“一个示例字符串”）。这是我目前的做法：

a = "this is a sample string"
b = "this is also a sample string"

aWords = a.split()
bWords = b.split()

#create counters to keep track of position in string
currentA = 0
currentB = 0

#create counter to keep track of longest sequence of matching words
matchStreak = 0

#create a list that contains all of the matchstreaks found
matchStreakList = []

#create binary switch to control the use of while loop
continueWhileLoop = 1

for word in aWords:
    currentA += 1

    if word == bWords[currentB]:
        matchStreak += 1

        #to avoid index errors, check to make sure we can move forward one unit in the b string before doing so
        if currentB + 1 < len(bWords):
            currentB += 1

        #in case we have two identical strings, check to see if we're at the end of string a. If we are, append value of match streak to list of match streaks
        if currentA == len(aWords):
            matchStreakList.append(matchStreak)

    elif word != bWords[currentB]:

        #because the streak is broken, check to see if the streak is >= 1. If it is, append the streak counter to out list of streaks and then reset the counter
        if matchStreak >= 1:
            matchStreakList.append(matchStreak)
        matchStreak = 0

        while word != bWords[currentB]:

            #the two words don't match. If you can move b forward one word, do so, then check for another match
            if currentB + 1 < len(bWords):
                currentB += 1

            #if you have advanced b all the way to the end of string b, then rewind to the beginning of string b and advance a, looking for more matches
            elif currentB + 1 == len(bWords):
                currentB = 0
                break

        if word == bWords[currentB]:
            matchStreak += 1

            #now that you have a match, check to see if you can advance b. If you can, do so. Else, rewind b to the beginning
            if currentB + 1 < len(bWords):
                currentB += 1
            elif currentB + 1 == len(bWords):

                #we're at the end of string b. If we are also at the end of string a, check to see if the value of matchStreak >= 1. If so, add matchStreak to matchStreakList
                if currentA == len(aWords):
                    matchStreakList.append(matchStreak)
                currentB = 0
                break

print matchStreakList

此脚本正确输出公共字长子串 (2, 3) 的（最大）长度，并且迄今为止对所有测试都这样做了。我的问题是：是否有一对两个字符串上面的方法不起作用？更重要的是：是否存在可用于查找两个字符串共享的所有 n 字长子字符串的最大长度的 Python 库或众所周知的方法？

[这个问题与最长公共子串问题不同，这只是我正在寻找的一个特例（因为我想找到所有公共子串，而不仅仅是最长的公共子串）。 This SO post 建议诸如 1) 聚类分析、2) 编辑距离例程和 3) 最长公共序列算法等方法可能是合适的方法，但我没有找到任何可行的解决方案，我的问题可能稍微容易一些链接中提到，因为我正在处理以空格为界的单词。]

编辑：

我开始在这个问题上悬赏。以防它对其他人有所帮助，我想澄清一些要点。首先，@DhruvPathak 下面建议的有用答案没有找到两个字符串共享的所有最大长度的 n 字长子字符串。例如，假设我们正在分析的两个字符串是：

“他们刚出生时都是一张一尘不染的白纸但它们要被每一根鹅毛笔潦草潦草涂抹”

和

“当你刚开始的时候，你都是白色的，一张可爱的、一尘不染的纸出生;但你要被每一只鹅的潦草潦草羽毛笔”

在这种情况下，最长 n 个字长的子串列表（不考虑尾随标点符号）是：

all
are
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every

使用以下例程：

#import required packages
import difflib

#define function we'll use to identify matches
def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

a = "They all are white a sheet of spotless paper when they first are born but they are to be scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless paper, when you first are born; but you are to be scrawled and blotted by every goose's quill"

a = a.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()
b = b.replace(",", "").replace(":","").replace("!","").replace("'","").replace(";","").lower()

print matches(a,b)

一个得到输出：

['e', ' all', ' white a sheet of', ' spotless paper when ', 'y', ' first are born but ', 'y', ' are to be scrawled', ' and blotted by every goose', ' quill']

首先，我不确定如何从这个列表中选择只包含整个单词的子字符串。其次，该列表不包括“are”，这是所需的最大长度的公共 n 字长子字符串之一。有没有一种方法可以找到这两个字符串共享的所有最长 n 个字长的子字符串（“你都是……”和“他们都是……”）？

【问题讨论】：

您希望输出是所有常见子字符串的列表吗？它们可以重叠吗？
试试看diff-match-patch，它是一组进行模糊字符串匹配的谷歌代码，里面可能有一些你可以使用的东西。
但是“一二三”和“一二二三”有两个最大长度的公共子串重叠。
那么您最感兴趣的是找到最长的子字符串，还是更复杂的？我只是在想，如果它们不能重叠，那么它似乎会使问题变得更加复杂，因为使一个子字符串更长可能会使另一个子字符串更短，并且您需要一个评分系统来决定返回子字符串是否更好长度为 3 或长度为 2 和长度为 4 的另一个。
我正在寻找所有最长的公共子字符串，无论它们是否重叠。抱歉上面说错了。（我认为重叠会阻止子字符串尽可能长，但@RemcoGerlich 帮助我证明情况并非如此。）

标签： python string algorithm pattern-matching substring

【解决方案1】：

这里仍有歧义，我不想花时间争论它们。但我认为无论如何我都可以添加一些有用的东西;-)

我写了 Python 的 difflib.SequenceMatcher，并花了很多时间寻找预期情况的快速方法来找到最长的公共子字符串。从理论上讲，这应该通过“后缀树”或相关的“后缀数组”来完成，这些“后缀数组”增加了“最长的公共前缀数组”（如果您想在 Google 上搜索更多信息，引号中的短语是搜索词）。那些可以在最坏的线性时间内解决问题。但是，有时情况下，最坏情况的线性时间算法极其复杂和微妙，并且会受到很大的常数因素的影响 - 如果要搜索给定的语料库，它们仍然可以获得巨大的回报很多次，但这不是 Python 的 difflib 的典型情况，而且看起来也不像你的情况。

无论如何，我在这里的贡献是重写SequenceMatcher 的find_longest_match() 方法以返回所有它沿途找到的（本地）最大匹配。备注：

我将使用 Raymond Hettinger 提供的 to_words() 函数，但不会转换为小写。转换为小写会导致输出与您所说的不完全一样。
尽管如此，正如我已经在评论中指出的那样，这确实会输出“quill”，它不在您所需的输出列表中。我不知道为什么不是，因为“quill”确实出现在两个输入中。

代码如下：

import re
def to_words(text):
    'Break text into a list of words without punctuation'
    return re.findall(r"[a-zA-Z']+", text)

def match(a, b):
    # Make b the longer list.
    if len(a) > len(b):
        a, b = b, a
    # Map each word of b to a list of indices it occupies.
    b2j = {}
    for j, word in enumerate(b):
        b2j.setdefault(word, []).append(j)
    j2len = {}
    nothing = []
    unique = set() # set of all results
    def local_max_at_j(j):
        # maximum match ends with b[j], with length j2len[j]
        length = j2len[j]
        unique.add(" ".join(b[j-length+1: j+1]))
    # during an iteration of the loop, j2len[j] = length of longest
    # match ending with b[j] and the previous word in a
    for word in a:
        # look at all instances of word in b
        j2lenget = j2len.get
        newj2len = {}
        for j in b2j.get(word, nothing):
            newj2len[j] = j2lenget(j-1, 0) + 1
        # which indices have not been extended?  those are
        # (local) maximums
        for j in j2len:
            if j+1 not in newj2len:
                local_max_at_j(j)
        j2len = newj2len
    # and we may also have local maximums ending at the last word
    for j in j2len:
        local_max_at_j(j)
    return unique

然后：

a = "They all are white a sheet of spotless paper " \
    "when they first are born but they are to be " \
    "scrawled upon and blotted by every goose quill"
b = "You are all white, a sheet of lovely, spotless " \
    "paper, when you first are born; but you are to " \
    "be scrawled and blotted by every goose's quill"

print match(to_words(a), to_words(b))

显示：

set(['all',
     'and blotted by every',
     'first are born but',
     'are to be scrawled',
     'are',
     'spotless paper when',
     'white a sheet of',
     'quill'])

编辑 - 工作原理

很多序列匹配和比对算法最好理解为在二维矩阵上工作，其中包含用于计算矩阵条目并随后解释条目含义的规则。

对于输入序列a 和b，描绘一个矩阵M，其中有len(a) 行和len(b) 列。在这个应用程序中，我们希望M[i, j] 包含以a[i] 和b[j] 结尾的最长公共连续子序列的长度，并且计算规则非常简单：

M[i, j] = 0 如果a[i] != b[j]。
M[i, j] = M[i-1, j-1] + 1 if a[i] == b[j]（我们假设越界矩阵引用默认返回 0）。

在这种情况下解释也很容易：有一个局部最大的非空匹配以a[i]和b[j]结尾，长度为M[i, j]，当且仅当@987654344 @ 非零，但 M[i+1, j+1] 为 0 或超出范围。

您可以使用这些规则编写非常简单且紧凑的代码，其中包含两个循环，可以针对此问题正确计算 M。缺点是代码将占用（最佳、平均和最坏情况）O(len(a) * len(b)) 时间和空间。

虽然一开始可能令人费解，但我发布的代码正是在执行上述操作。连接是模糊的，因为代码在几个方面针对预期情况进行了高度优化：

不是一次计算M，而是另一次解释结果，计算和解释在a的一次传递中交错。
因此，不需要存储整个矩阵。而是只有当前行 (newj2len) 和上一行 (j2len) 同时存在。
而且由于此问题中的矩阵通常大部分为零，因此此处的行通过 dict 将列索引映射到非零值来稀疏表示。零条目是“免费的”，因为它们从不明确存储。
处理一行时，无需遍历每一列：预先计算的b2j dict 准确地告诉我们当前行中有趣的列索引（那些与@987654354 中的当前word 匹配的列@)。
最后，部分是偶然的，所有前面的优化都以这样一种方式协同工作，即永远不需要知道当前行的索引，因此我们也不必费心计算它。

EDIT - 简单版

这是直接实现 2D 矩阵的代码，没有尝试优化（除了 Counter 通常可以避免显式存储 0 条目）。它非常简单、简短且容易：

def match(a, b):
    from collections import Counter
    M = Counter()
    for i in range(len(a)):
        for j in range(len(b)):
            if a[i] == b[j]:
                M[i, j] = M[i-1, j-1] + 1
    unique = set()
    for i in range(len(a)):
        for j in range(len(b)):
            if M[i, j] and not M[i+1, j+1]:
                length = M[i, j]
                unique.add(" ".join(a[i+1-length: i+1]))
    return unique

当然 ;-) 返回的结果与我最初发布的优化 match() 相同。

编辑 - 另一个没有字典

只是为了好玩 :-) 如果您已经掌握了矩阵模型，那么此代码将很容易理解。关于这个特定问题的一个值得注意的事情是，矩阵单元的值仅取决于沿单元西北对角线的值。所以“足够好”只需遍历所有主要对角线，从西部和北部边界的所有单元格向东南行进。这样，无论输入的长度如何，只需要很小的常量空间：

def match(a, b):
    from itertools import chain
    m, n = len(a), len(b)
    unique = set()
    for i, j in chain(((i, 0) for i in xrange(m)),
                      ((0, j) for j in xrange(1, n))):
        k = 0
        while i < m and j < n:
            if a[i] == b[j]:
                k += 1
            elif k:
                unique.add(" ".join(a[i-k: i]))
                k = 0
            i += 1
            j += 1
        if k:
            unique.add(" ".join(a[i-k: i]))
    return unique

【讨论】：

这真是太棒了。你刚刚震撼了我的世界。我花了一分钟才弄清楚M[i, j] = M[i-1, j-1] + 1 中发生了什么（主要是因为我还不熟悉 Counter() 方法），但我刚刚弄清楚了，整个过程变得如此清晰。感谢您的精彩回答，以及出色的解释。
:-) 很高兴您发现它有帮助！现在您了解了矩阵框架，您可以继续理解计算 Levenshtein 编辑距离（以及更多）的程序。请注意，这里的 Counter 并没有什么特别之处；例如，M = collections.defaultdict(int) 也可以。 Counter() 和 defaultdict(int) 在传递不存在的密钥时都返回 0，这是非常简单的 match() 需要的唯一特殊属性。但是，为了达到最高速度，您希望坚持使用标准 dicts，就像优化后的 match() 使用一样。
@TimPeters 如果 pypy 与您改编的 match() （也许是 difflib）兼容，您是否意识到？顺便说一句，上面的第一个 match() 完成后可能应该是 return unique ;)
@drevicko，这里不知道 pypy 的兼容性。至于第一个match()，return unique 是它的最后一条语句——也许你需要向下滚动才能看到它？
啊，确实 - mac 的滚动条在您滚动它们之前不会出现，所以我假设它们不存在。我会用 pypy 试一试，看看会发生什么（：

【解决方案2】：

您的帖子中确实嵌入了四个问题。

1) 如何将文本拆分为单词？

有很多方法可以做到这一点，具体取决于您将什么视为一个单词、您是否关心大小写、是否允许使用缩写等。正则表达式可让您实现您选择的分词规则。我通常使用的是r"[a-z'\-]+"。捕获像 don't 这样的缩略词，并允许像 mother-in-law 这样的连字符。

2) 什么数据结构可以加快公共子序列的搜索速度？

创建显示每个单词的位置图。例如，在句子 you should do what you like 中，you 的映射是 {"you": [0, 4]}，因为它出现了两次，一次在位置 0，一次在位置 4。

有了位置图，循环起点以比较 n 长子序列是一件简单的事情。

3) 如何找到常见的 n 长度子序列？

遍历其中一个句子中的所有单词。对于每个这样的单词，找到它在另一个序列中出现的位置（使用位置图）并测试两个 n 长度的切片是否相等。

4) 如何找到最长的公共子序列？

max() 函数找到最大值。它需要一个诸如 len() 之类的关键函数来确定比较的基础。

这里有一些工作代码，您可以根据自己对问题的解释进行自定义：

import re

def to_words(text):
    'Break text into a list of lowercase words without punctuation'
    return re.findall(r"[a-z']+", text.lower())

def starting_points(wordlist):
    'Map each word to a list of indicies where the word appears'
    d = {}
    for i, word in enumerate(wordlist):
        d.setdefault(word, []).append(i)
    return d

def sequences_in_common(wordlist1, wordlist2, n=1):
    'Generate all n-length word groups shared by two word lists'
    starts = starting_points(wordlist2)
    for i, word in enumerate(wordlist1):
        seq1 = wordlist1[i: i+n]
        for j in starts.get(word, []):
            seq2 = wordlist2[j: j+n]
            if seq1 == seq2 and len(seq1) == n:
                yield ' '.join(seq1)

if __name__ == '__main__':

    t1 = "They all are white a sheet of spotless paper when they first are " \
         "born but they are to be scrawled upon and blotted by every goose quill"

    t2 = "You are all white, a sheet of lovely, spotless paper, when you first " \
         "are born; but you are to be scrawled and blotted by every goose's quill"

    w1 = to_words(t1)
    w2 = to_words(t2)

    for n in range(1,10):
        matches = list(sequences_in_common(w1, w2, n))
        if matches:
            print(n, '-->', max(matches, key=len))

【讨论】：

这非常有帮助！你的提议让我有机会学习字典、enumerate()和get()，这些都是我之前遇到过但直到现在才有机会坐下来学习的。感谢您提供这个最有帮助的答案！

【解决方案3】：

difflib 模块非常适合这种情况，请参阅get_matching_blocks：

import difflib

def matches(first_string,second_string):
    s = difflib.SequenceMatcher(None, first_string,second_string)
    match = [first_string[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

first_string = "this is a sample string"
second_string = "this is also a sample string"
print matches(second_string, first_string )

演示：http://ideone.com/Ca3h8Z

【讨论】：

这很有帮助，但它不能识别所有常见的最长 n 字子串，这正是我所追求的。详情见上文...

【解决方案4】：

稍微修改一下，我想匹配的不是字符而是单词：

def matche_words(first_string,second_string):
    l1 = first_string.split()
    l2 = second_string.split()
    s = difflib.SequenceMatcher(None, l1, l2)
    match = [l1[i:i+n] for i, j, n in s.get_matching_blocks() if n > 0]
    return match

演示：

>>> print '\n'.join(map(' '.join, matches(a,b)))
all
white a sheet of
spotless paper when
first are born but
are to be scrawled
and blotted by every
quill

【讨论】：

每个字符串中的第一个“are”也应该匹配，但在此脚本中不匹配
@duhaime 不太确定，因为all 也匹配，并且are 在第一个在all 之后，在第二个之前。
@duhaime 这是非对称匹配器。您可以使用matches(b, a)，其中存在但全部不存在。
我认为all 应该匹配的原因与are 应该匹配的原因相同——两者都是出现在两个字符串中的最大长度的 n 字单元。你不同意吗？