查找最长重复子串的快速算法[重复]答案

【问题标题】：Fast algorithm to find longest repeating substring [duplicate]查找最长重复子串的快速算法[重复]
【发布时间】：2021-08-29 20:19:54
【问题描述】：

我正在寻找一种快速算法，它可以在给定字符串中搜索最长的重复子字符串（至少重复 1 次），同时尽可能降低时间复杂度和（如果可能）内存 (RAM)。

我见过一些实现，但大多数都不是为大量字符设计的（比如说4k, 400k, 4m... length）。一个例子是this one：

from collections import deque

def largest_substring_algo1(string):
    l = list(string)
    d = deque(string[1:])
    match = []
    longest_match = []
    while d:
        for i, item in enumerate(d):
            if l[i]==item:
                match.append(item)
            else:
                if len(longest_match) < len(match):
                    longest_match = match
                match = []
        d.popleft()
    return ''.join(longest_match)

我一直在尝试使用包含 103440816326530612244897959183673469387755102040816326530612244897959183673469387755 的字符串 100 次。

它适用于小字符串（

编辑：有没有办法在内存中不加载（比如说 20GB）文件？

【问题讨论】：

这能回答你的问题吗？ Find longest repetitive sequence in a string

标签： python algorithm substring

【解决方案1】：

def main():
    from time import time
    data = '103440816326530612244897959183673469387755102040816326530612244897959183673469387755'*100

    start_time = time()
    ans1 = largest_substring_algo1(data)
    print(f'{time()-start_time}')
    # 3.889688014984131

    start_time = time()
    ans2 = longestDupSubstring(data)
    print(f'{time()-start_time}')
    # 0.014296770095825195

    print(ans1 == ans2)
    # True


def longestDupSubstring(S):
    '''
    I improved it from python2 to python3: https://leetcode.com/problems/longest-duplicate-substring/discuss/290871/Python-Binary-Search
    '''
    A = [ord(c) - ord('a') for c in S]
    mod = 2**63 - 1
    from functools import reduce

    def test(L):
        p = pow(26, L, mod)
        cur = reduce(lambda x, y: (x * 26 + y) % mod, A[:L], 0)
        seen = {cur}
        for i in range(L, len(S)):
            cur = (cur * 26 + A[i] - A[i - L] * p) % mod
            if cur in seen:
                return i - L + 1
            seen.add(cur)
    res, lo, hi = 0, 0, len(S)
    while lo < hi:
        mi = (lo + hi + 1) // 2
        pos = test(mi)
        if pos:
            lo = mi
            res = pos
        else:
            hi = mi - 1
    return S[res:res + lo]


def largest_substring_algo1(string):

    from collections import deque
    l = list(string)
    d = deque(string[1:])
    match = []
    longest_match = []
    while d:
        for i, item in enumerate(d):
            if l[i] == item:
                match.append(item)
            else:
                if len(longest_match) < len(match):
                    longest_match = match
                match = []
        d.popleft()
    return ''.join(longest_match)


if __name__ == '__main__':
    main()

【讨论】：

这很快，但为什么它返回几乎整个字符串而不是重复部分？