解析多 fasta 文件以提取序列答案

【问题标题】：Parse multi-fasta file to extract out sequences解析多 fasta 文件以提取序列
【发布时间】：2018-03-30 16:14:01
【问题描述】：

我正在尝试用 python 编写一个脚本来解析一个大的 fasta 文件，我不想使用 biopython，因为我正在学习脚本。脚本需要将入藏号、序列长度、序列gc内容打印到控制台。我已经能够提取入藏号，但无法提取序列，因为它们被读取为行，这使我无法计算序列长度和 gc 内容。

谁能帮帮我？我尝试将列表中的行分组，但随后在列表中创建了多个列表，我也不知道如何加入它们。

seq=""
seqcount=0
seqlen=0
gc=0

#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")

with open(infile, "r") as fasta:
    print("\n")
    print ("Accession Number \t Sequence Length \t GC content (%)")
    for line in fasta:
       line.strip()
       if line[0]==">":
           seqcount+=1 #counts number sequences in file
           accession=line.split("|")[3] #extract accession
           seq=""
       else: 
           seq+=line[:-1]
           seqlen=len(seq)
           print(accession, "\t \t", seqlen)

print("\n")           
print("There are a total of", seqcount, "sequences in this file.")

【问题讨论】：

什么是fasta文件？它有什么格式？我想没有多少人知道。
我认为您处于一个小众领域，正如@EugeneLisitsky 所说，没有多少人会知道 fasta 文件的外观并且无法为您提供帮助，如果您是最好的选择想自己学习如何做是通过 biopython 方法阅读。 github.com/biopython/biopython/blob/master/Bio/SeqIO/FastaIO.py
.fasta 文件是一种基于文本的格式，通常用于表示核苷酸或肽。标题或基因名称通常看起来像这样：>gene|accession ACTGACTAGGGACTGADEA
这种情况的幽默是 - 我熟悉 python 和遗传学（密码子三联体等），但我从未使用过 fasta 或 biopython。所以好的定义是非常有用的。谢谢 ;)

标签： python fasta

【解决方案1】：

你离正确的代码不远了：

seq=""
seqcount=0

#prompt user for file name
infile=input("Enter the name of your designated .fasta file: ")

def pct_gc(s):
    gc = s.count('G') + s.count('C') + s.count('g') + s.count('c')
    total = len(s)

    return gc*100.0/total

with open(infile, "r") as fasta:
    print("\n")
    print ("Accession Number\tSequence Length\tGC content (%)")
    for line in fasta:
        line = line.strip()
        if line[0]==">":
            if seq != "":
                print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))

            seqcount+=1 #counts number sequences in file
            accession=line.split("|")[3] #extract accession
            seq=""
        else: 
            seq+=line[:-1]

print("{}\t{}\t{}".format(accession, pct_gc(seq), len(seq)))
print("\n")           
print("There are a total of " + str(seqcount) + " sequences in this file.")

要寻找的东西：

您不需要在每次迭代中更新长度。最后只计算它。
str.strip() 不修改对象，而是返回一个剥离的对象
当您找到下一个并且该序列不为空时，您必须使用您知道已阅读完整序列的事实。到那时，您必须编写输出。
最后一个序列没有被新的加入完成，因此您必须在循环结束后独立处理它。
使用字符串格式或连接字符串。如果你只是把字符串和变量用逗号分隔，你会得到一个元组表示输出。

【讨论】：