【问题标题】:python - Selectively choosing nucleotide sequences from a fasta file?python - 从fasta文件中选择性地选择核苷酸序列?
【发布时间】:2014-12-27 23:14:57
【问题描述】:

如果基因名称存储在文本文件中,我如何使用 biopython 从 fasta 文件中截取我感兴趣的基因?

#extract genes                
f1 = open('ortholog1.txt','r')
f2 = open('all.fasta','r')
f3 = open('ortholog1.fasta','w')

genes = [line.rstrip('\n') for line in f1.readlines()]


i=0
for seq_record in SeqIO.parse(f2, "fasta"):
    if genes[i] == seq_record.id:              
        print genes[i]
        f3.write('>'+genes[i])
        i=i+1
        if i==18:
            break
        f3.write('\n')
        f3.write(str(seq_record.seq))
        f3.write('\n')


f2.close()
f3.close()

我正在尝试上面的代码。但是它有一些错误并且不是通用的,因为像ortholog1.txt(包含基因名称)还有5个类似的文件。每个文件中的基因数量也各不相同(不是像这里那样总是 18 个)。这里all.fasta 是包含所有基因的文件。 ortholog1.fasta 必须包含剪断的核苷酸序列。

【问题讨论】:

    标签: python bioinformatics biopython


    【解决方案1】:

    基本上,您可以让 Biopython 完成所有工作。

    我猜测“ortholog1.txt”中的基因名称与fasta文件中的基因名称完全相同,并且每一行都有一个基因名称。如果没有,您需要根据需要调整它们以使它们对齐。

    from Bio import SeqIO
    
    with open('ortholog1.txt','r') as f:
        orthologs_txt = f.read()
    orthologs = orthologs_txt.splitlines()
    
    genes_to_keep = []
    for record in SeqIO.parse(open('all.fasta','r'), 'fasta'):
        if record.description in orthologs:
            genes_to_keep.append(record)
    
    with open('ortholog1.fasta','w') as f:
        SeqIO.write(genes_to_keep, f, 'fasta')
    

    编辑:这是使输出基因保持与直向同源文件中相同顺序的一种方法:

    from Bio import SeqIO
    
    with open('all.fasta','r') as fasta_file:
        record_dict = SeqIO.to_dict(open(SeqIO.parse(fasta_file, 'fasta')
    
    with open('ortholog1.txt','r') as text_file:
        orthologs_txt = text_file.read()
    
    genes_to_keep = []     
    for ortholog in orthologs_txt.splitlines():
        try:
            genes_to_keep.append( record_dict[ortholog] )
        except KeyError:
            pass
    
    with open('ortholog1.fasta','w') as output_file:
        SeqIO.write(genes_to_keep, output_file, 'fasta')
    

    【讨论】:

    • 有没有办法按照ortholog1.txt中基因的相同顺序对输出fasta文件中基因的顺序进行排序?
    • 保持它们有序的一种方法是将记录读入字典,然后逐步遍历正交日志。查看更新的答案
    【解决方案2】:

    我不是 biopython 专家,所以我会把输出的细节留给你,但数据流可以很简单(代码已注释)

    # Initially we have a dictionary where the keys are gene names and the
    # values are all empty lists
    genes = {gene.strip():[] for gene in open('ortholog1.txt','r')}
    
    # parse the data
    for record in SeqIO.parse(open('all.fasta','r'), "fasta"):
        # append the current record if its id is in the keys of genes
        if record.id in genes:
           genes[record.id].append(record)
    
    with open('ortholog1.fasta','w') as fout:
        # we read again, line by line, the file of genes
        for gene in open('ortholog1.txt','r'):
            # if the list associated with the current gene is not empty
            if genes[gene.strip()]:
                # output the list of records for the current gene using
                # biopython facilities
                ...
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-09-15
      • 2017-07-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-02-19
      相关资源
      最近更新 更多