【问题标题】:how to count the biggest consecutive occurences of substring in string?如何计算字符串中子字符串的最大连续出现次数?
【发布时间】:2021-08-04 12:07:52
【问题描述】:

我正在做一个练习 (cs50 - DNA),我必须计算模拟 DNA 序列的特定连续子串 (STRS),我发现自己的代码过于复杂,我很难弄清楚如何继续.

我有一个子字符串列表:

strs = ['AGATC', 'AATG', 'TATC']

还有一个带有随机字母序列的字符串:

AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG

我想计算匹配每个 strs 的最大连续子字符串。

所以:

  • 'AGATC' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAAGGAGGGATAGAAGG

  • 'AATG' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAAGGAGGGATAGAAGG

  • 'TATC' - AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAAGGAGGGATAGAAGG

导致[4, 1, 5]

(请注意,这不是最好的例子,因为周围没有随机重复的模式,但我认为它说明了我在寻找什么)

我知道我应该成为 re.match(rf"({strs}){2,}", string) 之类的人,因为 str.count(strs) 会给我所有连续和非连续项目

到目前为止我的代码:

#!/usr/bin/env python3
import csv
import sys
from cs50 import get_string

# sys.exit to terminate the program
# sys.exit(2) UNIX default for wrong args
if len(sys.argv) != 3:
    print("Usage: python dna.py data.csv sequence.txt")
    sys.exit(2)

# open file, make it into a list, get STRS, remove header
with open(sys.argv[1], "r") as database:
    data = list(csv.reader(database))
    STRS = data[0]
    data.pop(0)

# remove "name" so only thing remaining are STRs
STRS.pop(0)

# open file to compare agaist db
with open(sys.argv[2], "r") as seq:
    sequence = seq.read()

sequenceCount = []

# for each STR count the occurences
# sequence.count(s) returns all
for s in STRS:
    sequenceCount.append(sequence.count(s))

print(STRS)
print(sequenceCount)

"""
sequenceCount = {}

# for each STR count the occurences
for s in STRS:
    sequenceCount[s] = sequence.count(s)

for line in data:
    print(line)
    for item in line[1:]:
        continue


# rf"({STRS}){2,}"
"""

【问题讨论】:

  • @MiguelP 所以我说得对,你只想找到序列的“连续”(back2back)匹配,而不是在 dna 链中“孤立”中存在的一个或两个?
  • @OmarAlSuwaidi 是的!

标签: python


【解决方案1】:

查找重复字符串的正则表达式如r"(AGATC)+"

例如,

import re

sequence = "AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG"
pattern = "AGATC"

r = re.search(r"({})+".format(pattern), sequence)

if r:
    print("start at", r.start())
    print("end at", r.end())

如果找到匹配项,则可以通过.start.end 方法访问开始和结束位置。您可以使用它们计算重复次数。

如果您需要查找序列中的所有匹配项,则可以使用re.finditer,它可以迭代地匹配对象。

您可以遍历目标模式并找到最长的模式。

【讨论】:

    【解决方案2】:

    这里使用了两个for循环;一个从strs获取每个字符串(序列),另一个迭代我们的dna链以匹配来自strs的每个字符串,如果找到匹配则使用while循环继续寻找连续(back2back)匹配。 (添加内联 cmets 对每个步骤进行简要说明)

    dna = 'AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATAGATCTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG'
    strs = ['AGATC', 'AATG', 'TATC']
    
    
    def seq_finder(sequence, dna):
        start = 0  # Will allow us to skip scanned sequences
        counter = [0] * len(sequence)  # Create a list of zeros to store sequence occurrences
        for idx, seq in enumerate(sequence):  # Iterate over every entry in our sequence "strs"
            k = len(seq)
            holder = 0  # A temporarily holder that will store #occurrences of *consecutive* sequences
            for i in range(start, len(dna)):  # For each sequence, iterate over our "dna" strand
                if dna[i:i+k] == strs[idx]:  # If match is found:
                    holder += 1  # Increment our holder by 1
                    while dna[i:i+k] == dna[i+k:i+k*2]:  # If our match has an identical match ahead (consecutively):
                        holder += 1  # Increment our holder by 1
                        i += k  # Start the next list indexing from our new match
                        start = i + 1  # To skip repetitive iterations over same matches
                    if holder > counter[idx]:
                        counter[idx] = holder  # Only replace counter if new holder > old holder
                    holder = 0  # Reset the holder when we existed our of our while loop (finished finding consecutives)
        return counter
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-01-27
      • 1970-01-01
      • 2020-02-21
      • 2012-02-12
      • 2020-02-21
      相关资源
      最近更新 更多