SeqIO.parse Biopython - 我应该指定哪种文件格式？答案

【问题标题】：SeqIO.parse Biopython - which file format should I specify?SeqIO.parse Biopython - 我应该指定哪种文件格式？
【发布时间】：2020-12-31 23:00:58
【问题描述】：

我正在尝试使用 biopython 从多 fasta 文件（例如 C/G/A/T 计数、CG%）中提取信息。当我尝试迭代每个 fasta 序列的文件时，我总是遇到麻烦 - 我只能打印出第一个。

我怀疑这可能与我的文件格式有关，因为它不是实际的 fasta 文件，但我不知道如何更改它。

input_file = open("inputfile.fa", 'r')
output_file = open('nucleotide_counts.txt','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')

#count nucleotides in this record..gene_name = cur_record.name
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta"):
    gene_name = cur_record.name 
    A_count = cur_record.seq.count('A')
    C_count = cur_record.seq.count('C')
    G_count = cur_record.seq.count('G')
    T_count = cur_record.seq.count('T')
    length = len(cur_record.seq)
    cg_percentage = (float(C_count + G_count) / length)*100
    output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
    (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
    output_file.write(output_line)
output_file.close()
input_file.close()

这就是我的 multifasta 的样子（指定了开始和结束）

>1:start-end
CGCCCCAGTGATGTAGCCGAA
>1:start-end
CGGCCACCCCGAAGCGTGGGG

而我的输出文件只包含一行：

Gene    A       C       G       T       Length  CG%
1:start-end 85      115     180     59      439     67.198178

【问题讨论】：

乍一看，我看不出您的代码无法运行的任何明显原因。但我可以看到你的输入和输出不匹配。在这两个 fasta 记录中，没有 85 个 As。您能否发布该输入的示例输入和预期输出？
我读到你的文件有不同的格式，所以我实际上会删除我的答案。在我看来，这也应该有效
虽然我没有在此处发布正确的输入和输出，但我检查了它并为第一个序列提供了正确的值，所以我想我的循环方式一定有问题？
无法重现 - 您当前的代码不是最优雅的，但它确实工作，因为它会根据需要遍历所有记录。在您编辑之前，缩进是错误的，所以请在您的 actual 代码中检查它
您可能希望在计算氨基酸的步骤中包含多处理。如果它变成需要遍历的大型 fasta 文件，这是一项劳动强度大的工作。看看我的回答here你怎么能做这样的事情。

标签： python biopython fasta

【解决方案1】：


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Dec 31 15:11:53 2020

@author: Pietro
"""
input_file = 'fasta'


output_file_name = input_file+'out'


#count nucleotides in this record..gene_name = cur_record.name
from Bio import SeqIO

output_file = open(output_file_name, 'w+')
output_file.write(('Gene\tA\tC\tG\tT\tLength\tCG%\n'))

for cur_record in SeqIO.parse(input_file, "fasta"):
    gene_name = cur_record.name 
    A_count = cur_record.seq.count('A')
    C_count = cur_record.seq.count('C')
    G_count = cur_record.seq.count('G')
    T_count = cur_record.seq.count('T')
    length = len(cur_record.seq)
    cg_percentage = (float(C_count + G_count) / length)*100
    output_line = str('%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage))

    output_file.write(output_line)
output_file.close()

【讨论】：