CS50 PSET6 DNA 不匹配使用正则表达式计算 STR答案

【问题标题】：CS50 PSET6 DNA no match using regex to count STRCS50 PSET6 DNA 不匹配使用正则表达式计算 STR
【发布时间】：2020-06-23 16:18:23
【问题描述】：

我在这一点上卡了很长时间，希望得到一些提示。

这个问题可以简化为找出字符串中模式的最大连续出现次数。作为模式AATG，对于像ATAATGAATGAATGGAATG 这样的字符串，正确的结果应该是3。我厌倦了使用re.compile() 来计算模式的出现次数。我从文档中发现，如果我想找到连续出现的模式，我可能必须使用特殊字符 +。例如，像AATG 这样的模式我必须使用re.compile(r'(AATG)+') 而不是re.compile(r'AATG')。否则，出现次数将被多计。然而，在这个程序中，模式不是一个固定的字符串。我把它当作一个变量。我尝试了很多方法将其放入re.compile()，但没有得到积极的结果。谁能告诉我格式化它的正确方法（在下面的函数def countSTR 中）？

在那之后，我认为finditer(the_string_to_be_analysis) 应该返回一个迭代器，包括找到的所有匹配项。然后我使用match.end() - match.start() 获取每个匹配的长度以相互比较，以获得最长连续出现的模式。也许那里出了点问题？

附上代码。我们将不胜感激每一个输入！

from sys import argv, exit
import csv
import re

def main():
    if len(argv) != 3:
        print("Usage: python dna.py data.csv sequence.txt")
        exit(1)

    # read DNA sequence
    with open(argv[2], "r") as file:
        if file.mode != 'r':
            print(f"database {argv[2]} can not be read")
            exit(1)
        sequence = file.read()

    # read database.csv
    with open(argv[1], newline='') as file:
        if file.mode != 'r':
            print(f"database {argv[1]} can not be read")
            exit(1)
        # get the heading of the csv file in order to obtain STRs
        csv_reader = csv.reader(file)
        headings = next(csv_reader)
        # dictionary to store STRs match result of DNA-sequence
        STR_counter = {}
        for STR in headings[1::]:
            # entry result accounting to the STR keys
            STR_counter[STR] = countSTR(STR, sequence)
    # read csv file as a dictionary
    with open(argv[1], newline='') as file:
        database = csv.DictReader(file)
        for row in database:
            count = 0
            for STR in STR_counter:
                # print("row in database ", row[STR], "STR in STR_counter", STR_counter[STR])
                if int(row[STR]) == int(STR_counter[STR]):
                    count += 1
            if count == len(STR_counter):
                print(row['name'])
                exit(0)
        else:
            print("No match")

# find non-overlapping occurrences of STR in DNA-sequence
def countSTR(STR, sequence):
    count = 0
    maxcount = 0
    # in order to match repeat STR. for example: "('AATG')+" as pattern
    # into re.compile() to match repeat STR
    # rewrite STR to "(STR)+"
    STR = "(" + STR + ")+"
    pattern = re.compile(r'STR')
    # matches should be a iterator object
    matches = pattern.finditer(sequence)
    # go throgh every repeat and find the longest one
    # by match.end() - match.start()
    for match in matches:
        count = match.end() - match.start()
        if count > maxcount:
            maxcount = count
    # return repeat times of the longest repeat
    return maxcount/len(STR)

main()

【问题讨论】：

标签： python cs50

【解决方案1】：

只要找到正确的方法来获得所需的结果。把它贴在这里，以防其他人也感到困惑。据我了解，要匹配名为var_pattern 的变量，可以使用re.compile(rf'{var_pattern}')。然后如果要搜索连续出现的 var_pattern，可以使用re.compile(rf'(var_pattern)+')。可能还有其他更聪明的方法来实现它，但是我设法让它像以前一样正常工作。

【讨论】：