【问题标题】:Fast algorithm to find longest repeating substring [duplicate]查找最长重复子串的快速算法[重复]
【发布时间】:2021-08-29 20:19:54
【问题描述】:

我正在寻找一种快速算法,它可以在给定字符串中搜索最长的重复子字符串(至少重复 1 次),同时尽可能降低时间复杂度和(如果可能)内存 (RAM)。

我见过一些实现,但大多数都不是为大量字符设计的(比如说4k, 400k, 4m... length)。一个例子是this one

from collections import deque

def largest_substring_algo1(string):
    l = list(string)
    d = deque(string[1:])
    match = []
    longest_match = []
    while d:
        for i, item in enumerate(d):
            if l[i]==item:
                match.append(item)
            else:
                if len(longest_match) < len(match):
                    longest_match = match
                match = []
        d.popleft()
    return ''.join(longest_match)

我一直在尝试使用包含 103440816326530612244897959183673469387755102040816326530612244897959183673469387755 的字符串 100 次。

它适用于小字符串(

编辑:有没有办法在内存中不加载(比如说 20GB)文件?

【问题讨论】:

标签: python algorithm substring


【解决方案1】:
def main():
    from time import time
    data = '103440816326530612244897959183673469387755102040816326530612244897959183673469387755'*100

    start_time = time()
    ans1 = largest_substring_algo1(data)
    print(f'{time()-start_time}')
    # 3.889688014984131

    start_time = time()
    ans2 = longestDupSubstring(data)
    print(f'{time()-start_time}')
    # 0.014296770095825195

    print(ans1 == ans2)
    # True


def longestDupSubstring(S):
    '''
    I improved it from python2 to python3: https://leetcode.com/problems/longest-duplicate-substring/discuss/290871/Python-Binary-Search
    '''
    A = [ord(c) - ord('a') for c in S]
    mod = 2**63 - 1
    from functools import reduce

    def test(L):
        p = pow(26, L, mod)
        cur = reduce(lambda x, y: (x * 26 + y) % mod, A[:L], 0)
        seen = {cur}
        for i in range(L, len(S)):
            cur = (cur * 26 + A[i] - A[i - L] * p) % mod
            if cur in seen:
                return i - L + 1
            seen.add(cur)
    res, lo, hi = 0, 0, len(S)
    while lo < hi:
        mi = (lo + hi + 1) // 2
        pos = test(mi)
        if pos:
            lo = mi
            res = pos
        else:
            hi = mi - 1
    return S[res:res + lo]


def largest_substring_algo1(string):

    from collections import deque
    l = list(string)
    d = deque(string[1:])
    match = []
    longest_match = []
    while d:
        for i, item in enumerate(d):
            if l[i] == item:
                match.append(item)
            else:
                if len(longest_match) < len(match):
                    longest_match = match
                match = []
        d.popleft()
    return ''.join(longest_match)


if __name__ == '__main__':
    main()

【讨论】:

  • 这很快,但为什么它返回几乎整个字符串而不是重复部分?
猜你喜欢
  • 2012-05-08
  • 2020-12-09
  • 1970-01-01
  • 1970-01-01
  • 2016-08-11
  • 2014-03-22
  • 1970-01-01
  • 2020-03-22
  • 2011-08-25
相关资源
最近更新 更多