如何找到一个字符序列在字符串中连续重复的最大次数？ [复制]答案

【问题标题】：How to find the max number of times a sequence of characters repeats consecutively in a string? [duplicate]如何找到一个字符序列在字符串中连续重复的最大次数？ [复制]
【发布时间】：2020-07-15 13:49:55
【问题描述】：

我正在开发一个 cs50/pset6/dna 项目。我正在努力寻找一种方法来分析字符串序列，并收集某个字符序列连续重复的最大次数。这是一个例子：

字符串：JOKHCNHBVDBVDBVDJHGSBVDBVD

我应该寻找的字符序列：BVD

结果：我的函数应该可以返回3，因为在某一时刻字符BVD连续重复了3次，即使它再次重复了2次，我应该寻找它重复最多的时间次数。

【问题讨论】：

我们应该考虑重叠吗？例如：如果给定的序列是“ABA”，搜索空间是“ABABA”，答案是什么？
嘿！答案应该仍然是 1。很棒的观察！
@Axe319，我不这么认为，因为子字符串必须是连续的......
哦，没关系。我错过了这个要求。
您还没有发布任何代码供我们帮助。

标签： python python-3.x string

【解决方案1】：

这有点蹩脚，但一种“蛮力”的方法是只检查是否存在可能的最长子字符串。一旦找到子字符串，就跳出循环：

编辑 - 使用函数可能更直接：

def get_longest_repeating_pattern(string, pattern):
    if not pattern:
        return ""
    for i in range(len(string)//len(pattern), 0, -1):
        current_pattern = pattern * i
        if current_pattern in string:
            return current_pattern
    return ""

string = "JOKHCNHBVDBVDBVDJHGSBVDBVD"
pattern = "BVD"


longest_repeating_pattern = get_longest_repeating_pattern(string, pattern)
print(len(longest_repeating_pattern))

编辑 - 解释：

首先，只是一个简单的 for 循环，它从一个较大的数字开始，一直到一个较小的数字。例如，我们从 5 开始向下到 0（但不包括 0），步长为 -1：

>>> for i in range(5, 0, -1):
    print(i)

    
5
4
3
2
1
>>>

如果string = "JOKHCNHBVDBVDBVDJHGSBVDBVD"，那么len(string)就是26，如果pattern = "BVD"，那么len(pattern)就是3。

回到我原来的代码：

for i in range(len(string)//len(pattern), 0, -1):

插入数字：

for i in range(26//3, 0, -1):

26//3 是一个整数除法，产生8，所以变成：

for i in range(8, 0, -1):

所以，这是一个从 8 到 1 的 for 循环（请记住，它不会下降到 0）。 i 每次迭代都采用新值，首先是 8 ，然后是 7 ，等等。

在 Python 中，您可以“乘”字符串，如下所示：

>>> pattern = "BVD"
>>> pattern * 1
'BVD'
>>> pattern * 2
'BVDBVD'
>>> pattern * 3
'BVDBVDBVD'
>>>

【讨论】：

嘿@Paul M。你能解释一下你的代码是做什么的吗？我不明白这个语法：for i in range(len(string)//len(pattern), 1, -1): if pattern * i in string:
感谢@PaulM 的编辑。我去看看！
@NicolasF 我已经编辑了我的帖子，并修复了两个问题。
嘿@PaulM。我认为只有一个问题。 else 语句不正确...
@NicolasF else 语句仅在您的 for 循环在没有 break 的情况下完成时执行。

【解决方案2】：

一个稍微不那么暴力的解决方案：

string = 'JOKHCNHBVDBVDBVDJHGSBVDBVD'
key = 'BVD'

len_k = len(key)
max_l = 0
passes = 0
curr_len=0

for i in range(len(string) - len_k + 1): # split the string into substrings of same len as key
    if passes > 0: # If key was found in previous sequences, pass ()this way, if key is 'BVD', we will ignore 'VD.' and 'D..'
        passes-=1
        continue
    s = string[i:i+len_k]
    if s == key:
        curr_len+=1
        if curr_len > max_l:
            max_l=curr_len
        passes = len(key)-1
        if prev_s == key:
            if curr_len > max_l:
                max_l=curr_len
    else:
        curr_len=0
    prev_s = s
    
print(max_l)

【讨论】：

【解决方案3】：

您可以使用正则表达式非常轻松、优雅且高效地做到这一点。

我们会查找您的搜索字符串至少重复一次的所有序列。然后，我们只需要取这些序列的最大长度，然后除以搜索字符串的长度。

我们使用的正则表达式是'(:?<your_sequence>)+'：组(<your_sequence>) 至少重复一次（+）。 :? 只是为了使组不捕获，因此findall 返回整个匹配，而不仅仅是组。

如果不匹配，我们使用max函数的default参数返回0。

代码很短，那么：

import re

def max_consecutive_repetitions(search, data):
    search_re = re.compile('(?:' + search + ')+')
    return max((len(seq) for seq in search_re.findall(data)), default=0) // len(search)

示例运行：

print(max_consecutive_repetitions("BVD", "JOKHCNHBVDBVDBVDJHGSBVDBVD"))
# 3

【讨论】：

【解决方案4】：

这是我的贡献，我不是专业人士，但它对我有用（抱歉英语不好）

    results = {}
# Loops through all the STRs
for i in range(1, len(reader.fieldnames)):
    STR = reader.fieldnames[i]
    j = 0
    s=0
    pre_s = 0
    # Loops through all the characters in sequence.txt
    while j < (len(sequence) - len(STR)):
        # checks if the character we are currently looping is the same than the first STR character
        if STR[0] == sequence[j]:
            # while the sub-string since j to j - STR lenght is the same than STR, I called this a streak
            while sequence[j:(j + len(STR))] == STR:
                # j skips to the end of sub-string
                j += len(STR)
                # streaks counter
                s += 1
            # if s > 0 means that that the whole STR and sequence coincided at least once
            if s > 0:
                # save the largest streak as pre_s
                if s > pre_s:
                    pre_s = s
                # restarts the streak counter to continue exploring the sequence
                s=0
        j += 1
    # assigns pre_s value to a dictionary with the current STR as key
    results[STR] = pre_s
print(results)

【讨论】：