Python 代码在 CS50x 的 DNA 问题中运行时间过长答案

【问题标题】：Python code takes too long to run in DNA problem from CS50xPython 代码在 CS50x 的 DNA 问题中运行时间过长
【发布时间】：2021-01-20 06:01:12
【问题描述】：

我从 CS50 的第 6 周开始编写代码来解决 DNA 问题。但是，当我在 large.csv 数据库和序列上运行它时，至少需要一分钟才能产生输出。在 small.csv 上，它会立即产生输出。正因为如此，我无法通过check50。我想问题出在生成STR的最大重复序列数的函数的阶段，但我不知道如何更有效地编写它。问题的完整描述在这里：https://cs50.harvard.edu/x/2021/psets/6/dna/#:~:text=check50%20cs50/problems/2021/x/dna

以下是数据库和序列的源文件： https://cdn.cs50.net/2019/fall/psets/6/dna/

这是我的代码：

import csv
import sys


def main():
    
    # check a proper input
    if len(sys.argv) != 3:
        sys.exit("Usage: python dna.py data.csv sequence.txt")
    
    # create a list for all data
    data_all = []
    
    # create a list for all STRs
    STR_all = []
    
    # write data to list
    with(open(sys.argv[1])) as data:
        reader = csv.DictReader(data)
        for row in reader:
            row["name"]
            data_all.append(row)
            
    # write header to a list 
    with(open(sys.argv[1])) as data:      
        reader = csv.reader(data)
        headings = next(reader)
        STR_all.append(headings)
    
    # delete "name" from header, it is on the first position    
    STR_all = STR_all[0]
    STR_all.pop(0)
            
    # create a string with DNA sequence
    with(open(sys.argv[2])) as seq:
        line = seq.read()
    
    # create a list with max number of repeating STR from a line(DNA)
    max_seq = []
    
    # enter data with string of STR and it's max repeating time    
    for i in range(len(STR_all)):
        result = f"{compare(STR_all[i], line)}"
        max_seq.append(result)
        
    # create a dictionary with a list of all STRs and according number of repeating sequences
    STR_with_max_seq = dict(zip(STR_all, max_seq))
    
    # compare values from data_all and STR_with_max_seq
    for i in range(len(data_all)):
        # delete name key and store key in variable "name"
        name = data_all[i].pop('name')
        if data_all[i] == STR_with_max_seq:
            print(name)
            sys.exit()
            break
        else:
            continue
        
    # Print if no match found
    print("No match")
        
    # variables that I used to check on different stages of writing a program
            
    # print(data_all)
    # print(line)
    # print(STR_all)
    # print(max_seq)
    # print(STR_with_max_seq)
    
    # print(len(data_all))
    # print(name)

    
def compare(STR, DNA):

    for key in DNA:
        l = len(STR)
        tmp_max = 0
        tmp = 0
        
        # iteration through the whole length of DNA
        for i in range(len(DNA)):
            if tmp > 0:
                tmp = 0
            
            # enters if sequences are equal
            if DNA[i: i + l] == STR:
                tmp += 1
                # increments tmp if its sequence repeats
                while DNA[i - l: i] == DNA[i: i + l]:
                    tmp += 1
                    i += l
                # update the max found number of repeating sequences    
                if tmp > tmp_max:
                    tmp_max = tmp
    
    return tmp_max

    
main()

更新：我使用 time.monotonic() 检查 main() 中代码执行的总时间。现在是 small.csv 的时候了：

单调时钟的值（以秒为单位）：661689.405232647
过程中经过的时间：0.02439890895038843

这是用于 large.csv：

单调时钟的值（以秒为单位）：661943.13288005
过程中经过的时间：108.33000503003132

【问题讨论】：

好吧，您可以在每个函数上放置一个观察器并将其打印出来以确定是什么耗时。主要是读取文件的原因，但无论如何，请务必检查所有内容。
@TấnNguyên 感谢您的回复！我用 time.monotonic() 检查了 small.csv 和 large.csv 的总时间。如果我理解正确的话，就程序的总运行而言，它并没有显着差异。
是的，所以现在您也可以使用 watcher 检查算法。你可能会探索一些东西

标签： python string split cs50

【解决方案1】：

我知道这个问题。您的代码的某些部分使其变慢。

首先，让我们尝试只读取每个文件一次。例如：

with(open(sys.argv[1])) as data:
    reader = csv.DictReader(data)
    STR_all = reader.fieldnames
    for row in reader:
        row["name"]
        data_all.append(row)

STR_all 将是一个列表，因此您可以删除该行：

STR_all = STR_all[0]

比较时可以建立计数器字典，避免重复两次。

例如，通过这样做：

    # enter data with string of STR and it's max repeating time    
    for i in range(len(STR_all)):
        key = STR_all[i]
        STR_with_max_seq[key] = str(compare(key, line))

你可以删除这个：

    # create a dictionary with a list of all STRs and according number of repeating sequences
    STR_with_max_seq = dict(zip(STR_all, max_seq))

最后，为了改进函数比较，可以避免第一个循环。您想找出 DNA 中 STR 连续出现的最大次数。因此，您只需通过 STR 长度的窗口遍历 DNA 并比较它们。例如：

def compare(STR, DNA):
    l = len(STR)
    tmp_max = 0
    tmp = 0
    i = 0
        
    # iteration through the whole length of DNA
    while i < len(DNA) - l:  # make sure the last str has length = l
        SSTR = DNA[i : i + l]  # Extract a substre of length l

        if SSTR == STR:
            # You can jump l positions here.
            i += l
            tmp += 1
        else:
            i += 1
            if tmp > tmp_max:
                tmp_max = tmp
            tmp = 0                
    
    return tmp_max

【讨论】：