比较后匹配字符串的开始和结束索引答案

【问题标题】：Start and End indices of the matched strings after comparison比较后匹配字符串的开始和结束索引
【发布时间】：2022-01-25 15:20:05
【问题描述】：

我正在尝试创建两个列表，其中包含字符串的“开始”和“结束”索引。在这种情况下，两个字符串的长度相等。例如

str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

这里，匹配的长度是：GG、CG、CG
我想要以下类型的输出：

list = [2,3,6,7,10,11] #list of the matched indices
start = [2,6,10] #start indices of the matched lengths
end = [3,7,11] #end indices if the matched lengths

现在，我的代码块如下所示，但我希望索引能够定位匹配的序列。

str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

result1 = ''
result2 = ''

#handle the case where one string is longer than the other
maxlen=len(str2) if len(str1)<len(str2) else len(str1)

#loop through the characters
for i in range(maxlen): 
    letter1=str1[i:i+1]
    letter2=str2[i:i+1]
    if ((letter1 == letter2) and letter1 in ['A','T','C','G'] and letter2 in ['A','T','C','G']):
        result1+=letter1
        result2+=letter2

【问题讨论】：

你想要任意长度的字符串匹配吗？还是您在寻找长度为 2 的匹配项？
你的比赛总是成对的吗？例如，如果您的字符串是'AAAB' 和AAAC。您想要的索引列表[1, 2, 3] 或[1, 3] 是什么？
@MoinuddinQuadri [1, 3]

标签： python string list

【解决方案1】：

这实际上是在呼唤zip：

str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

matches = []
for i,(a,b) in enumerate(zip(str1,str2)):
    if a == b:
        if not matches or matches[-1][1] != i-1
            matches.append([i,i])
        else:
            matches[-1][1] += 1

print(matches)
starts = [k[0] for k in matches]
ends   = [k[1] for k in matches]

输出：

[[2, 3], [6, 7], [10, 11]]

这也将捕获单个字符匹配。如果需要，您可以在之后快速循环过滤掉这些内容。

【讨论】：

【解决方案2】：

让我们从一个帮助函数开始，它将计算给定索引处两个字符串的公共前缀的长度

def helper(index, str1, str2):
    length = 0
    try:
        while str1[index] == str2[index]: #and other needed conditions
            length += 1
            index += 1
    except IndexError:
        pass
    return length

现在我们想在迭代时使用它

index = 0
result = []
while index < min(len(str1), len(str2)):
    length = helper(index, str1, str2)
    if length > 0:
        result.append(i, i+length)
        i += length + 1 # We can omit one character as it was checked in helper
    else:
        i += 1

【讨论】：

【解决方案3】：

你也可以用正则表达式做类似的事情。

import re
str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

pat = 'GG|CG|CG'

matches = [[(m.span()[0],m.span()[1]-1) for m in re.finditer(pat,x)] for x in [str1,str2]]

m = set(matches[0]) & set(matches[1])
starts= [x[0] for x in m]
ends= [x[1] for x in m]

print(m,starts,ends, sep='\n')

输出

{(2, 3), (6, 7), (10, 11)}
[2, 6, 10]
[3, 7, 11]

【讨论】：

【解决方案4】：

您还可以使用numpy.split 拆分不连续的索引并在两行中获得所需的结果：

lst = [i for i, (s1,s2) in enumerate(zip(str1, str2)) if s1==s2]
start, end = zip(*[(arr[0], arr[-1]) for arr in np.split(lst, np.where(np.diff(lst) != 1)[0] + 1)])

输出：

((2, 6, 10), (3, 7, 11))

【讨论】：

【解决方案5】：

您的代码有一些更正 1) max() 是内置的，不需要执行 if 语句，2) 字符串已经是列表类型的对象，因此 "a" in "bbbbabb" 已经返回 True，不需要将每个字母放在一个列表中。

看来您需要一个函数来确定两个字符串的开头一致程度。

import itertools as it
def f(s,t): 
    return sum(it.takewhile(bool,map(lambda z:z[0]==z[1],zip(s,t))))

有了这样一个函数，我们现在可以按照你的描述来查找字符串之间任意长度的所有同时匹配：

str1='ATGGATCGATCG'
str2='CGGGCGCGCGCG'

matches = [(i,i+l-1) for i,(a,b) in enumerate(zip(str1,str2)) if (l:=f(str1[i:],str2[i:]))>=2]
print(matches)

【讨论】：