【发布时间】:2021-01-20 06:01:12
【问题描述】:
我从 CS50 的第 6 周开始编写代码来解决 DNA 问题。但是,当我在 large.csv 数据库和序列上运行它时,至少需要一分钟才能产生输出。在 small.csv 上,它会立即产生输出。正因为如此,我无法通过check50。我想问题出在生成STR的最大重复序列数的函数的阶段,但我不知道如何更有效地编写它。问题的完整描述在这里:https://cs50.harvard.edu/x/2021/psets/6/dna/#:~:text=check50%20cs50/problems/2021/x/dna
以下是数据库和序列的源文件: https://cdn.cs50.net/2019/fall/psets/6/dna/
这是我的代码:
import csv
import sys
def main():
# check a proper input
if len(sys.argv) != 3:
sys.exit("Usage: python dna.py data.csv sequence.txt")
# create a list for all data
data_all = []
# create a list for all STRs
STR_all = []
# write data to list
with(open(sys.argv[1])) as data:
reader = csv.DictReader(data)
for row in reader:
row["name"]
data_all.append(row)
# write header to a list
with(open(sys.argv[1])) as data:
reader = csv.reader(data)
headings = next(reader)
STR_all.append(headings)
# delete "name" from header, it is on the first position
STR_all = STR_all[0]
STR_all.pop(0)
# create a string with DNA sequence
with(open(sys.argv[2])) as seq:
line = seq.read()
# create a list with max number of repeating STR from a line(DNA)
max_seq = []
# enter data with string of STR and it's max repeating time
for i in range(len(STR_all)):
result = f"{compare(STR_all[i], line)}"
max_seq.append(result)
# create a dictionary with a list of all STRs and according number of repeating sequences
STR_with_max_seq = dict(zip(STR_all, max_seq))
# compare values from data_all and STR_with_max_seq
for i in range(len(data_all)):
# delete name key and store key in variable "name"
name = data_all[i].pop('name')
if data_all[i] == STR_with_max_seq:
print(name)
sys.exit()
break
else:
continue
# Print if no match found
print("No match")
# variables that I used to check on different stages of writing a program
# print(data_all)
# print(line)
# print(STR_all)
# print(max_seq)
# print(STR_with_max_seq)
# print(len(data_all))
# print(name)
def compare(STR, DNA):
for key in DNA:
l = len(STR)
tmp_max = 0
tmp = 0
# iteration through the whole length of DNA
for i in range(len(DNA)):
if tmp > 0:
tmp = 0
# enters if sequences are equal
if DNA[i: i + l] == STR:
tmp += 1
# increments tmp if its sequence repeats
while DNA[i - l: i] == DNA[i: i + l]:
tmp += 1
i += l
# update the max found number of repeating sequences
if tmp > tmp_max:
tmp_max = tmp
return tmp_max
main()
更新:我使用 time.monotonic() 检查 main() 中代码执行的总时间。现在是 small.csv 的时候了:
- 单调时钟的值(以秒为单位):661689.405232647
- 过程中经过的时间:0.02439890895038843
这是用于 large.csv:
- 单调时钟的值(以秒为单位):661943.13288005
- 过程中经过的时间:108.33000503003132
【问题讨论】:
-
好吧,您可以在每个函数上放置一个观察器并将其打印出来以确定是什么耗时。主要是读取文件的原因,但无论如何,请务必检查所有内容。
-
@TấnNguyên 感谢您的回复!我用 time.monotonic() 检查了 small.csv 和 large.csv 的总时间。如果我理解正确的话,就程序的总运行而言,它并没有显着差异。
-
是的,所以现在您也可以使用 watcher 检查算法。你可能会探索一些东西