使用 .fasta 文件计算序列的相对内容答案

【问题标题】：Using a .fasta file to compute relative content of sequences使用 .fasta 文件计算序列的相对内容
【发布时间】：2025-12-30 16:30:16
【问题描述】：

所以我是一个“菜鸟”，最近才通过 Perl 被介绍到编程，我仍然习惯这一切。我有一个必须使用的 .fasta 文件，但我不确定我是否能够打开它，或者我是否必须“盲目地”使用它，可以这么说。

无论如何，我拥有的文件包含三个基因的 DNA 序列，以 .fasta 格式编写。

显然是这样的：

>label
sequence
>label
sequence
>label
sequence

我的目标是编写一个脚本来打开和读取文件，我现在已经掌握了窍门，但是我必须读取每个序列，计算每个序列中“G”和“C”的相对数量，以及然后我要将基因的名称以及它们各自的“G”和“C”内容写入一个制表符分隔的文件。

谁能提供一些指导？我不确定制表符分隔的文件是什么，我仍在尝试弄清楚如何打开 .fasta 文件以实际查看内容。到目前为止，我使用的 .txt 文件可以轻松打开，但不是 .fasta。

对于听起来完全不知所措，我深表歉意。我会很感激你的耐心。我不像你们那里的专业人士！

【问题讨论】：

标签： perl sequence frequency fasta

【解决方案1】：

我知道这很令人困惑，但您确实应该尝试将您的问题限制在一个具体问题，请参阅https://*.com/faq#questions

我不知道“.fasta”文件或“G”和“C”是什么……但这可能无关紧要。

一般：

打开输入文件
读取和解析数据。如果它是某种您无法解析的奇怪格式，请在http://metacpan.org 上寻找一个模块来读取它。如果你很幸运，有人已经为你完成了困难的部分。
计算您要计算的任何内容
打印到屏幕（标准输出）或其他文件。

“制表符分隔”文件是具有列的文件（想想 Excel），其中每列由制表符 (“\t”) 字符分隔。正如快速谷歌或 * 搜索会告诉你的那样..

【讨论】：

fasta 文件包含 DNA（通常）序列。 DNA 序列使用 4 个字母 A、C、T 和 G 编码。在 Perl 中处理此类生物数据的首选库称为 BioPerl：bioperl.org

【解决方案2】：

这是一种使用“awk”实用程序的方法，可以从命令行使用。以下程序通过指定其路径并使用awk -f <path> <sequence file>来执行

#NR>1 means only look at lines above 1 because you said the sequence starts on line 2 
NR>1{
#this for-loop goes through all bases in the line and then performs operations below:
    for (i=1;i<=length;i++)
#for each position encountered, the variable "total" is increased by 1 for total bases        
       total++
    } 
    {
    for (i=1;i<=length;i++)
#if the "substring" i.e. position in a line == c or g upper or lower (some bases are
#lowercase in some fasta files), it will carry out the following instructions:
        if(substr($0,i,1)=="c" || substr($0,i,1)=="C")
#this increments the c count by one for every c or C encountered, the next if statement does
#the same thing for g and G:            
            c++; else
        if(substr($0,i,1)=="g" || substr($0,i,1)=="G")
            g++
    }
    END{
#this "END-block" prints the gene name and C, G content in percentage, separated by tabs       
        print "Gene name\tG content:\t"(100*g/total)"%\tC content:\t"(100*c/total)"%"
    }

【讨论】：