Perl 脚本对多个 DNA 序列进行子集化答案

【问题标题】：Perl script to subset multiple DNA sequencesPerl 脚本对多个 DNA 序列进行子集化
【发布时间】：2017-12-20 21:11:37
【问题描述】：

我有一个包含约 500 个 DNA 序列的 FASTA 文件，每个序列都有我已知的单核苷酸多态性 (SNP) 的目标位置。

对于文件中的每个条目，我都有一个单独的制表符分隔的文本文件，每一行都有一个

FASTA 序列名称
起始位置
结束位置
SNP 位置

文本文件中的序列和位置顺序一致。

虚拟 FASTA 文件是：

>AOS-94_S25_L002_R1_001_trimmed_contig_767
GACACACACTGATTGTTAGTGGTGTACAGACATTGCTTCAAACTGCA
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TAGGTTTTCTTTCCCATGTCCCCTGAATAACATGGGATTCCCTGTGACTGTGGGGACCCCTGAGAGCCTGGT
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
GATAAGGAGCTCACAGCAACCCACATGAGTTGTCC

虚拟位置文件是

AOS-94_S25_L002_R1_001_trimmed_contig_767   5   15  10
AOS-94_S25_L002_R1_001_trimmed_contig_2199  8   19  11
AOS-94_S25_L002_R1_001_trimmed_contig_2585  4   20  18

这是我编写并尝试过的脚本

use warnings;
use strict; 

# Read in the complete FASTA file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;

# Read in file of contig name, start pos, stop pos, SNP pos in tab delimited 
text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;

# Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;

local $/ = "\n>"; #Read by fasta record

my $seq1 = (); 

open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my @fields = <POS>;
    while (my $seq = <FASTA>){
        chomp $seq;
        my @seq = split(/\n/,$seq);
            if($seq[0] =~ /^>/){
                $seq1 = $seq[0];
            }elsif($seq[0] =~ /[^>]/){ #matches any character except the >
                $seq1 = ">".$seq[0];
            }
    for my $pos (@fields){
        chomp $pos;
        my @field = split(/\t/,$pos);
    open(OUTFILE,">>$out");
    print OUTFILE "$seq1";
    my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1]; 
    print OUTFILE "$subseq\n";
    }   
}
close FASTA;
close POS;
close OUTFILE;

这就是我想要的，这就是我想要的：

>AOS-94_S25_L002_R1_001_trimmed_contig_767
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
AGGAGCTCAC

但是，我还需要在序列名称之后打印出 SNP 位置（第 4 列），例如，

>AOS-94_S25_L002_R1_001_trimmed_contig_767
pos=10
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
pos=11
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
pos=18
AGGAGCTCAC

我尝试在print OUTFILE "$seq1"; 之后插入print OUTFILE "pos= $field[3]\n";，得到以下信息：

>AOS-94_S25_L002_R1_001_trimmed_contig_767
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
AGGAGCTCAC

显然我搞砸了我的循环，可能还有一些chomp 命令。

例如，当我print "$seq1" 到一个文件时，为什么它不需要在打印的字符串中包含"\n"？字符串中一定已经有硬返回了？

我知道我缺少一些有关其结构的基础知识，但我至今不知道如何解决我的错误。谁能给点建议？

更新

Perl 代码重新格式化以提高可读性

use warnings;
use strict;

# Read in the complete FASTA file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;

# Read in file of contig name, start pos, stop pos, SNP pos in tab delimited
text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;

#Output file
print "What are we calling the output? \n";
my $out = <STDIN>;
chomp $out;

local $/ = "\n>";    # Read by FASTA record

my $seq1 = ();

open( FASTA, $fasta ) || die "\n Unable to open the file!\n";
open( POS,   $text )  || die "\n Unable to open the file! \n";

my @fields = <POS>;

while ( my $seq = <FASTA> ) {

    chomp $seq;
    my @seq = split( /\n/, $seq );

    if ( $seq[0] =~ /^>/ ) {
        $seq1 = $seq[0];
    }
    elsif ( $seq[0] =~ /[^>]/ ) {    # matches any character except the >
        $seq1 = ">" . $seq[0];
    }

    for my $pos ( @fields ) {
        chomp $pos;
        my @field = split( /\t/, $pos );
        
        open( OUTFILE, ">>$out" );
        print OUTFILE "$seq1";

        my $subseq = substr $seq[1], $field[1] - 1, $field[2] - $field[1];
        print OUTFILE "$subseq\n";
    }
}

close FASTA;
close POS;
close OUTFILE;

【问题讨论】：

标签： perl

【解决方案1】：

你的代码有很多问题

您的 cmets 与代码不对应。例如，当代码只接受来自 STDIN 的文件名并对其进行修整时，您有 Read in the complete FASTA file。通常最好用精心挑选的标识符编写干净的代码；这样程序就可以自我解释了

您正在使用open 和全局文件句柄的两个参数形式。您也没有在die 字符串中失败的原因，并且末尾有一个换行符，这将阻止 Perl 为您提供错误所在的源文件名和行号发生了

类似

open( FASTA, $fasta ) || die "\n Unable to open the file!\n"

应该是

open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!}

和

open( OUTFILE, ">>$out" );

应该是

open my $out_fh, '>>', $output_file or die qq{Unable to open "$output_file" for appending: $!}

你应该避免在变量名两边加上引号。
```
print OUTFILE "$seq1"
```
应该是
```
print OUTFILE $seq1
```
您将输入记录分隔符设置为"\n>"。这意味着每次调用<FASTA> 时，Perl 都会读取到该字符串的下一次出现。这也意味着chomp 将从行尾准确删除该字符串（如果存在）

最大的问题是在读取POS 之前，您从不重置$/。请记住，它的设置会影响每个 readline（或<>）和每个 chomp。而且因为您的$text 文件在行首可能不包含> 字符，所以您将一次性读取整个文件

这就是为什么您在输出中看到换行符而不要求它们的原因。您已经阅读了整个文件以及所有嵌入的换行符，chomp 在这里没有用，因为您修改了它删除的字符串

local 以这种方式命名是有原因的。它会临时更改值并将本地更改为当前范围。但是您的“当前范围”是文件其余部分的全部内容，并且您正在使用修改后的终止符读取这两个文件

使用一些大括号{ ... } 来限制local 修改的范围。或者，由于最新版本的 Perl 中的文件句柄表现为 IO::Handle 对象，您可以编写

$fasta_fh->input_record_separator("\n>")

并且更改将仅应用于该文件句柄，并且根本不需要本地化$/

这是您的程序的修改版本，它还解决了标识符的一些错误选择以及其他一些问题。 请注意，此代码未经测试。我目前在火车上工作，只能在心里检查我在写什么

请注意，while ( <$fasta_fh> ) 和 for ( @pos_records ) 之类的内容在未指定循环变量时使用默认变量 $_。同样，chomp 和split 等运算符将在缺少相应参数时应用于$_。这样就不需要显式地提及任何变量，并且它会导致更简洁和可读的代码。 $_ 相当于英语中的 it

我鼓励你理解你所写的东西实际上是做什么的。从互联网的一部分复制代码并将其提供给其他地方的某些人以使其为您工作已成为一种普遍的做法。那不是“学习编程”，除非你学习这门语言并专心致志，否则你什么都不会理解

并且请更加小心地布置您的代码。我希望你能看到我对你的问题所做的编辑，以及我的解决方案中的代码，比你发布的程序更容易阅读？虽然欢迎您让自己的工作变得随心所欲，但向一个完全陌生的世界提供这样的混乱是不公平和不礼貌的，您正在寻求免费的编程帮助。一个不错的中间线是更改您的编辑器，以便在按下 Tab 键时使用 四个空格 的缩进。 切勿在源代码中使用制表符！

use strict;
use warnings 'all';

print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );

print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );

print "Name of the output file: ";
chomp( my $out_file = <STDIN> );

open my $out_fh, '>', $out_file die qq{Unable to open "$out_file" for output: $!};

my @pos_records = do {
    open $pos_, '<', $pos_file or die qq{Unable to open "$pos_file" for input: $!};
    <$pos_fh>;
};
chomp @pos_records; # Remove all newlines

{
    open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};

    local $/ = "\n>"; # Reading FASTA format now

    while ( <$fasta_fh> ) {

        chomp;    # Remove "">\n" from the end

        my ( $header, $seq ) = split /\n/; # Separate the two lines

        $header =~ s/^>?/>/; # Replace any chomped >

        for ( @pos_records ) {

            my ( $name, $beg, $end, $pos ) = split /\t/;
            my $subseq = substr $seq, $beg-1, $end-$beg;

            print $out_fh "$header\n";
            print $out_fh "pos=$pos\n";
            print $out_fh "$subseq\n";
        }
    }
} # local $/ expires here

close $out_fh or die $!;

【讨论】：

我为草率的格式道歉，很抱歉您不得不花时间修复它。我真诚地感谢您为帮助我所付出的时间。我正在尝试让您的示例正常工作，但是循环中发生了一些事情，在输出文件中，有 9 个 fasta 序列而不是 3 个。无论如何，我现在正试图弄清楚。您的示例更加简洁易懂。
@user：好的，问题是 FASTA 文件中的每一行都与 pos 文件中的每一行配对。我这样写是因为这是您自己的代码所做的，但看起来您需要通过序列 ID 将这些行配对在一起。我会在早上修好它。

【解决方案2】：

好的，经过几个非常小的修改，您的代码运行良好。这是对我有用的解决方案：

#!/usr/bin/perl
use strict;
use warnings;

print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );

print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );

print "Name of the output file: ";
chomp( my $out_file = <STDIN> );

open my $out_fh, '>', $out_file or die qq{Unable to open "out_file" for output: $!};


my @pos_records = do {
    open my $pos_, '<' , $pos_file or die qq{Unable to open "$pos_file" for input: $!};
    <$pos_>;
};
chomp @pos_records; #remove all newlines  

{
     open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};

     local $/ = "\n>"; #Reading FASTA format now

     for ( <$fasta_fh> ) {

         chomp; #Remove ">\n" from the end

         my ( $header, $seq) = split /\n/; #separate the two lines

         $header = ">$header" unless $header =~ /^>/; # Replace any chomped >


     for ( @pos_records ) {

             my ($name,$beg,$end,$pos) = split /\t/;
             my $subseq = substr $seq, $beg-1, $end-$beg;
             my $final_SNP = $end - $pos; 

             if($header =~ /$name/){

               print $out_fh "$header\n";
               print $out_fh "pos=$final_SNP\n";
               print $out_fh "$subseq\n";
     }
    } 
  }
} #local expires here

close $out_fh or die $!;

我唯一更改的实质性内容是添加了一个 if 语句。没有它，每个 fasta 序列都被写入了 3 次，每一个都带有一个带有三个 SNP 位置之一的位置。我还稍微改变了我正在做的标记 SNP 位置的操作，在删除序列后，它实际上是 $end - $pos 而不仅仅是 $pos。

再次感谢你，因为很明显你花了很多时间帮助我。对于它的价值，我真诚地感谢它。您的解决方案将作为我未来工作的模板（可能是对 fasta 文件的类似操作），您的解释帮助我更好地理解 local 以我的豌豆大脑可以理解的方式所做的事情。

【讨论】：

我很高兴你能成功，但你没有注意到我写的关于你的缩进的内容。这是可怕的和非常不一致的，并且非常难以阅读。正确处理这一点非常重要，因为它可以使错误变得不那么明显。