打印字符串中子字符串的所有起始索引的更快方法，包括重叠出现答案

【问题标题】：Faster way to print all starting indices of a substring in a string, including overlapping occurences打印字符串中子字符串的所有起始索引的更快方法，包括重叠出现
【发布时间】：2017-04-14 15:14:48
【问题描述】：

我正在尝试回答这个家庭作业问题：查找字符串中所有出现的模式。子字符串的不同出现可以相互重叠。

示例 1。

输入：

TACG

GT

输出：

解释：模式比文本长，因此在文本中没有出现。

示例 2。

输入：

ATA

ATATA

输出：

0 2

解释：模式出现在位置 1 和 3（并且这两个出现相互重叠）。

示例 3。

ATAT

GATATATGCATATACTT

输出：

1 3 9

说明：图案出现在文本中的位置 1、3 和 9。

我提交的答案是这个：

def all_indices(text, pattern):
    i = text.find(pattern)
    while i >= 0:
        print(i, end=' ')
        i = text.find(pattern, i + 1)


if __name__ == '__main__':
    text = input()
    pattern = input()
    all_indices(text, pattern)

但是，此代码未通过最终测试用例：

失败案例 #63/64：超出时间限制（使用时间：7.98/4.00，使用内存：77647872/536870912。）

在线评委知道我在用 Python 发送答案，并且对于不同的语言有不同的时间限制。

我已经搜索了很多其他答案和方法：regexes、suffix trees、Aho-Corasick... 但到目前为止，它们都没有达到这个简单的解决方案（可能是因为 find 是 implemented in C? )。

所以我的问题是：有没有办法更快地完成这项任务？

【问题讨论】：

我不确定。这个错误对我来说似乎并不明显。如果你自己运行它，它最终会完成吗？需要多长时间？
执行此操作的任何算法都将花费与文本长度成正比的时间。所以他们总是可以使文本足够长以超过某个时间限制。
他们有没有告诉你字符串的最大可能长度？
最坏的情况是pattern = 'A', text = 'A' * 10e6
一个优化是在i > len(text) - len(pattern)时停止

标签： python python-3.x string-matching

【解决方案1】：

如果print 是最让你的程序变慢的地方，你应该尽量少调用它。一个快速而肮脏的解决方案：

def all_indices(string, pattern):
    result = []
    idx = string.find(pattern)
    while idx >= 0:
        result.append(str(idx))
        idx = string.find(pattern, idx + 1)
    return result

if __name__ == '__main__':
    string = input()
    pattern = input()
    ' '.join(all_indices(string, pattern))

将来要正确识别代码的哪一部分减慢了整个过程，您可以使用python profilers

【讨论】：

【解决方案2】：

我相信测试用例对Knuth-Morris-Pratt 算法更加宽容。这段代码复制自https://en.wikibooks.org/wiki/Algorithm_Implementation/String_searching/Knuth-Morris-Pratt_pattern_matcher#Python，通过了所有案例：

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

#from http://code.activestate.com/recipes/117214/
def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
    Calling conventions are similar to string.find, but its arguments can be
    lists or iterators, not just strings, it returns all matches, not just
    the first one, and it does not need the whole text in memory at once.
    Whenever it yields, it will have read the text exactly up to and including
    the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

【讨论】：