【发布时间】:2020-06-23 16:18:23
【问题描述】:
我在这一点上卡了很长时间,希望得到一些提示。
这个问题可以简化为找出字符串中模式的最大连续出现次数。作为模式AATG,对于像ATAATGAATGAATGGAATG 这样的字符串,正确的结果应该是3。我厌倦了使用re.compile() 来计算模式的出现次数。我从文档中发现,如果我想找到连续出现的模式,我可能必须使用特殊字符 +。例如,像AATG 这样的模式我必须使用re.compile(r'(AATG)+') 而不是re.compile(r'AATG')。否则,出现次数将被多计。然而,在这个程序中,模式不是一个固定的字符串。我把它当作一个变量。我尝试了很多方法将其放入re.compile(),但没有得到积极的结果。谁能告诉我格式化它的正确方法(在下面的函数def countSTR 中)?
在那之后,我认为finditer(the_string_to_be_analysis) 应该返回一个迭代器,包括找到的所有匹配项。然后我使用match.end() - match.start() 获取每个匹配的长度以相互比较,以获得最长连续出现的模式。也许那里出了点问题?
附上代码。我们将不胜感激每一个输入!
from sys import argv, exit
import csv
import re
def main():
if len(argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
exit(1)
# read DNA sequence
with open(argv[2], "r") as file:
if file.mode != 'r':
print(f"database {argv[2]} can not be read")
exit(1)
sequence = file.read()
# read database.csv
with open(argv[1], newline='') as file:
if file.mode != 'r':
print(f"database {argv[1]} can not be read")
exit(1)
# get the heading of the csv file in order to obtain STRs
csv_reader = csv.reader(file)
headings = next(csv_reader)
# dictionary to store STRs match result of DNA-sequence
STR_counter = {}
for STR in headings[1::]:
# entry result accounting to the STR keys
STR_counter[STR] = countSTR(STR, sequence)
# read csv file as a dictionary
with open(argv[1], newline='') as file:
database = csv.DictReader(file)
for row in database:
count = 0
for STR in STR_counter:
# print("row in database ", row[STR], "STR in STR_counter", STR_counter[STR])
if int(row[STR]) == int(STR_counter[STR]):
count += 1
if count == len(STR_counter):
print(row['name'])
exit(0)
else:
print("No match")
# find non-overlapping occurrences of STR in DNA-sequence
def countSTR(STR, sequence):
count = 0
maxcount = 0
# in order to match repeat STR. for example: "('AATG')+" as pattern
# into re.compile() to match repeat STR
# rewrite STR to "(STR)+"
STR = "(" + STR + ")+"
pattern = re.compile(r'STR')
# matches should be a iterator object
matches = pattern.finditer(sequence)
# go throgh every repeat and find the longest one
# by match.end() - match.start()
for match in matches:
count = match.end() - match.start()
if count > maxcount:
maxcount = count
# return repeat times of the longest repeat
return maxcount/len(STR)
main()
【问题讨论】: