【发布时间】:2021-08-04 12:07:52
【问题描述】:
我正在做一个练习 (cs50 - DNA),我必须计算模拟 DNA 序列的特定连续子串 (STRS),我发现自己的代码过于复杂,我很难弄清楚如何继续.
我有一个子字符串列表:
strs = ['AGATC', 'AATG', 'TATC']
还有一个带有随机字母序列的字符串:
AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
我想计算匹配每个 strs 的最大连续子字符串。
所以:
-
'AGATC'- AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAAGGAGGGATAGAAGG -
'AATG'- AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAAGGAGGGATAGAAGG -
'TATC'- AAGGTAAGTTTAGAATATAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAAGGAGGGATAGAAGG
导致[4, 1, 5]
(请注意,这不是最好的例子,因为周围没有随机重复的模式,但我认为它说明了我在寻找什么)
我知道我应该成为 re.match(rf"({strs}){2,}", string) 之类的人,因为 str.count(strs) 会给我所有连续和非连续项目。
到目前为止我的代码:
#!/usr/bin/env python3
import csv
import sys
from cs50 import get_string
# sys.exit to terminate the program
# sys.exit(2) UNIX default for wrong args
if len(sys.argv) != 3:
print("Usage: python dna.py data.csv sequence.txt")
sys.exit(2)
# open file, make it into a list, get STRS, remove header
with open(sys.argv[1], "r") as database:
data = list(csv.reader(database))
STRS = data[0]
data.pop(0)
# remove "name" so only thing remaining are STRs
STRS.pop(0)
# open file to compare agaist db
with open(sys.argv[2], "r") as seq:
sequence = seq.read()
sequenceCount = []
# for each STR count the occurences
# sequence.count(s) returns all
for s in STRS:
sequenceCount.append(sequence.count(s))
print(STRS)
print(sequenceCount)
"""
sequenceCount = {}
# for each STR count the occurences
for s in STRS:
sequenceCount[s] = sequence.count(s)
for line in data:
print(line)
for item in line[1:]:
continue
# rf"({STRS}){2,}"
"""
【问题讨论】:
-
@MiguelP 所以我说得对,你只想找到序列的“连续”(back2back)匹配,而不是在 dna 链中“孤立”中存在的一个或两个?
-
@OmarAlSuwaidi 是的!
标签: python