【问题标题】:How to execute a function once something happens until something else happens一旦发生某事,如何执行一个函数,直到发生其他事情
【发布时间】:2021-11-08 00:58:13
【问题描述】:

我需要创建一个程序来获取包含 DNA 的文件并将开放阅读框转换为蛋白质数据。一旦出现“ATG”,我需要运行该函数,直到出现终止密码子“TAG”“TAA”或“TGA”。

我是编程新手,这就是我所拥有的,

map = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }
DNA = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'
DNAlist = []
DNAlist1 = []
DNAlist2 = []
protein = []

for i in range(0, len(DNA), 3):
    DNAlist.append(DNA[i:i+3])
for i in range(1, len(DNA), 3):
    DNAlist1.append(DNA[i:i+3])
for i in range(2, len(DNA), 3):
    DNAlist2.append(DNA[i:i+3])

while True:
        if elements in DNAlist2 == 'TAG' or 'TAA' or 'TGA':
            False
        else:
            protein = ''.join([map[elements] for elements in DNAlist2])```

A sample output would be 
MLLGSFRLIPKETLIQVAGSSPCNLS
M
MGMTPRLGLESLLE
MTPRLGLESLLE

【问题讨论】:

标签: python loops bioinformatics biopython


【解决方案1】:

在不使用 Biopython 的情况下进行了很好的尝试,仅用于正向链(无反向)并使用了翻译后的序列

找到了两种方法,我确信这些不是最佳方法,我在这里等待更好的人How to find a open reading frame in Python 是我认为最快的方法。

即使没有终止密码子,第一个也会为您提供 ORF(序列不会终止,因此没有 '_' 表示终止密码子存在:

mappy = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }



# for i in mappy:
#     print(mappy[i])

DNA = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'
DNAlist1 = []
DNAlist2 = []
DNAlist3 = []
# protein = []


def revProt(dna_list):
    proteinz = []
    for elements in dna_list:
        if len(elements) == 3:
            proteinz.append(mappy[elements])
    
            
    # proteinz = ''.join(proteinz)
    return ''.join([ i for i in reversed(proteinz)])

for i in range(0, len(DNA), 3):
    DNAlist1.append(DNA[i:i+3])
for i in range(1, len(DNA), 3):
    DNAlist2.append(DNA[i:i+3])
for i in range(2, len(DNA), 3):
    DNAlist3.append(DNA[i:i+3])
    
# for i in [DNAlist1] : #, DNAlist2, DNAlist3]:

for i in [DNAlist1, DNAlist2, DNAlist3]:
    
    protein = revProt(i)
    
    print(''.join(protein), type(''.join(protein)))
        
    seqs = []
    j = 0
    orf = []
    while True:
            if j <= len(protein)-1:
                if protein[j] == '_' :
                    if orf[0] == 'M':
                        orf.append('_')
                        seqs.append(''.join([i for i in reversed(orf)]))
                        orf = []
                    else :
                        orf = []
                        orf.append('_')
                if protein[j] not in [ '_' , 'M'] :
                    orf.append(protein[j])
                if protein[j] == 'M':
                    orf.append(protein[j])
                    seqs.append(''.join([i for i in reversed(orf)]))
                   
            else : 
                break
            j += 1
            
    
    print(seqs, '\n')
    

输出:

QSAVRIM_A_ELLSELGLRPTMGMYGSNAVHS <class 'str'>
['MIRVASQ', 'MTPRLGLESLLE_', 'MGMTPRLGLESLLE_']   -----> here sequences 1st is at the end of DNA so no stop

LH_ES_EPKNWFLS_DLDRP_GWTVQTL_MA <class 'str'>
['M_'] 

SISSPDNLSIGFSVRIWTAPDDGHLRL_SCP <class 'str'>
[] 

第二种方式更麻烦:

import itertools

mappy = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }

DNA = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'


DNAlist1 = []
DNAlist2 = []
DNAlist3 = []

def Prot(dna_list):
    proteinz = []
    for elements in dna_list:
        if len(elements) == 3:
            proteinz.append(mappy[elements])
      
    # proteinz = ''.join(proteinz)
    return proteinz


def Met(protein):
    
    met = [i for i, x in enumerate(protein) if x == "M"] 
    return met
        
def Stop(protein):
    
    stop = [i for i, x in enumerate(protein) if x == "_"]   
    return stop
         
    


for i in range(0, len(DNA), 3):
    DNAlist1.append(DNA[i:i+3])
for i in range(1, len(DNA), 3):
    DNAlist2.append(DNA[i:i+3])
for i in range(2, len(DNA), 3):
    DNAlist3.append(DNA[i:i+3])
    
for i in [DNAlist1, DNAlist2, DNAlist3]:

    
    protein = Prot(i)
    
    print(''.join(protein), type(''.join(protein)))
    
    
    met = Met(protein)
    
    # print('met : ', met)
    
    stop = Stop(protein)
    
    # print('stop : ' , stop)
    
    # print('------------------')
    
    orf = [i for i in list(itertools.product(met, stop)) if i[0] < i[1]]
    
    print(orf)
    
    
    orf_p = [''.join(protein[j[0]:j[1]]) for j in orf]
    
    orf_pp = [i for i in orf_p]
    
    
    for y in orf_p:
        
        # print(y, type(y))
        if '_' in y:
            # print('ok')
            orf_pp.remove(y)
    
    print('orf_pp : ',orf_pp)
    
 
    print('______________')

输出:

SHVANSGYMGMTPRLGLESLLE_A_MIRVASQ <class 'str'>
[(8, 22), (8, 24), (10, 22), (10, 24)]
orf_pp :  ['MGMTPRLGLESLLE', 'MTPRLGLESLLE']  ----->here the sequences
______________
AM_LTQVTWG_PRDLD_SLFWNKPE_SE_HL <class 'str'>
[(1, 2), (1, 10), (1, 16), (1, 25), (1, 28)]
orf_pp :  ['M']
______________
PCS_LRLHGDDPATWIRVSFGISLNDPSSIS <class 'str'>
[]
orf_pp :  []
______________

更短(从How to find a open reading frame in Python复制可能更快

import re



mappy = {
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',                
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',
    }



# for i in mappy:
#     print(mappy[i])

DNA = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'


def Prot(dna_list):
    proteinz = []
    for elements in dna_list:
        if len(elements) == 3:
            proteinz.append(mappy[elements])

    return proteinz
    


pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')

def revcomp(dna_seq):
    return dna_seq[::-1].translate(str.maketrans("ATGC","TACG"))

def orfs(dna):
    return set(pattern.findall(dna) + pattern.findall(revcomp(dna)))



    
for j in orfs(DNA):
        # print(j, type(j))
        DNAlistz = []
        for i in range(0, len(j), 3):
            DNAlistz.append(j[i:i+3])
        print(''.join(Prot(DNAlistz)))
    
    
print('+++++++++++++')

这次也输出反向链翻译:

MGMTPRLGLESLLE
MTPRLGLESLLE
M
MLLGSFRLIPKETLIQVAGSSPCNLS
+++++++++++++

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-02-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-05-08
    相关资源
    最近更新 更多