Perl 文本解析；哪个算法是正确的？答案

【问题标题】：Perl Text-Parsing; Which algorithm is correct?Perl 文本解析；哪个算法是正确的？
【发布时间】：2014-04-18 16:14:32
【问题描述】：

我正在编写一个 Perl 脚本，该脚本将两个文件作为输入：一个输入是一个制表符分隔的表，其中的标识符对第二列感兴趣，第二个输入是与第一列的第二列匹配的标识符列表文件。目标是仅打印表中第二列中包含标识符的行，并且每行仅打印一次。我已经编写了这个程序的三个版本，并且发现每个版本都打印了不同数量的行。

版本 1：

    # TAB-SEPARTED TABLE FILE
    open (FILE, $file); 
    while (<FILE>) {
            my $line = $_;
            chomp $line;
            # ARRAY CONTAINING EACH IDENTIFIER AS A SEPARATE ELEMENT
            foreach(@refs) {  
                    my $ref = $_;
                    chomp $ref;
                    if ( $line =~ $ref) { print "$line\n"; next; }
            }
    }

版本 2：

    # ARRAY CONTAINING EVERY LINE OF THE TAB-SEPARATED TABLE AS A SEPARATE LINE
    foreach(@doc) { 
            my $full = $_;
            # IF LOOP FOR PRINTING THE HEADER BUT NOT COMPARING IT TO ARRAY BELOW
            if ( $counter == 0 ) { 
                      print "$full\n"; 
                      $counter++; 
                      next; }
            # EXTRACT IDENTIFIER FROM LINE
            my @cells = split('\t', $full); 
            my $gene = $cells[1]; 
            foreach(@refs) {
                my $text = $_;
                if ( $gene =~ $text && $counter == 1 ) { # COMPARE IDENTIFIER 
                        print "$full\n";
                        next;
                }
        }
        $counter--;
    }

版本 3：

    # LIST OF IDENTIFIERS
    foreach(@refs) {
        my $ref = $_;
        # LIST OF EACH ROW OF THE TABLE
        foreach(@doc) {
                my $line = $_;
                my @cells = split('\t', $line);
                my $gene = $cells[1];
                if ( $gene =~ $ref ) { print "$line\n"; next; }
        }
    }

这些方法中的每一种都给我不同的输出，我不明白为什么。我也不明白我是否可以相信他们中的任何一个能给我正确的输出。正确的输出不应包含任何重复的行，但不止一行可能与列表中的任何标识符匹配。

示例输入文件：

    Position        Symbol  Name    REF     ALT
    chr1:887801     NOC2L   nucleolar complex associated 2 homolog (S. cerevisiae)  A       G
    chr1:888639     NOC2L   nucleolar complex associated 2 homolog (S. cerevisiae)  T       C
    chr1:888659     NOC2L   nucleolar complex associated 2 homolog (S. cerevisiae)  T       C
    chr1:897325     KLHL17  kelch-like 17 (Drosophila)      G       C
    chr1:909238     PLEKHN1 pleckstrin homology domain containing, family N member 1        G       C
    chr1:982994     AGRN    agrin   T       C
    chr1:1254841    CPSF3L  cleavage and polyadenylation specific factor 3-like     C       G
    chr1:3301721    PRDM16  PR domain containing 16 C       T
    chr1:3328358    PRDM16  PR domain containing 16 T       C

列表是从如下所示的文件中提取的：

    A1BG
    A2M
    A2ML1
    AAK1
    ABCA12
    ABCA13
    ABCA2
    ABCA4
    ABCC2

使用以下代码将其放入数组中：

    open (REF, $ref_file);
    while (<REF>) {
        my $line = $_;
        chomp $line;
        push(@refs, $line);
     }
     close REF;

【问题讨论】：

三个版本的输入输出样本怎么样？
最好编辑问题以添加示例数据。在评论中看不到布局。
我在版本 2 和 3 中看不到任何内容来检查是否已经打印了一行。版本 1 只能打印一次 FILE 中的一行，但不检查文件是否有同一行的多个副本。

标签： perl bioinformatics text-parsing

【解决方案1】：

每当你听到“我需要查找一些东西”时，想想哈希。

您可以做的是创建一个哈希，其中包含您要从文件 #1 中提取的元素。然后，使用第二个哈希来跟踪您之前是否打印过它：

#!/usr/bin/env perl

use warnings;
use strict;
use feature qw(say);

use autodie;   # This way, I don't have to check my open for failures

use constant {
    TABLE_FILE          => "file1.txt",
    LOOKUP_FILE         => "file2.txt",
};

open my $lookup_fh, "<", LOOKUP_FILE;

my %lookup_table;
while ( my $symbol = <$lookup_fh> ) {
    chomp $symbol,
    $lookup_table{$symbol} = 1;
}

close $lookup_fh;

open my $table_file, "<", TABLE_FILE;

my %is_printed;
while ( my $line = <$table_file> ) {
    chomp $line;
    my @line_array = split /\s+/, $line;
    my $symbol = $line_array[1];
    if ( exists $lookup_table{$symbol} and not exists $is_printed{$symbol} ) {
        say $line;
        $is_printed{$symbol} = 1;
    }
}

两个循环，但效率更高。在您的文件中，如果第一个文件中有 100 个项目，第二个文件中有 1000 个项目，则必须循环 100 * 1000 次或 1,000,000。在这种情况下，您只需循环两个文件中的总行数。

我使用open 命令的三参数方法，它允许您处理名称以| 或< 等开头的文件。此外，我为我的文件句柄使用变量，这样更容易如果需要，将文件句柄传递给子例程。

我使用use autodie; 来处理文件打不开等问题。在您的程序中，程序将继续其愉快的方式。如果你不想使用autodie，你需要这样做：

 open $fh, "<", $my_file or die qq(Couldn't open "$my_file" for reading);

我使用两个哈希。第一个是%lookup_table，它存储您要打印的符号。当我浏览第一个文件时，我可以简单地检查 `$lookup_table{$symbol} 是否存在。如果没有，我不打印，如果有，我打印。

第二个哈希%is_printed 跟踪我已经打印的符号。如果$is_printed{$symbol} 存在，我知道我已经打印了那行。

尽管您说第二个表是制表符分隔的，但我还是使用/\s+/ 作为拆分正则表达式。这将捕获一个选项卡，但如果有人使用了两个选项卡（以保持外观美观）或在该选项卡之前不小心输入了一个空格，它也会捕获。

【讨论】：

此解决方案只需稍作修改即可正常工作：hast %is_printed 需要将整行作为其键而不是 $symbol。谢谢。
我误解了你想要的。我以为你只想要打印$symbol 的第一行。是的，如果您想要不同但有多个符号的行，您可以使用整行。如果两条重复的行具有不同的空白，则必须小心。您可能希望使用$line =~ s/\s+/\s/g; 作为%is_printed 的键，以确保意外空格不会导致问题。

【解决方案2】：

我很确定这应该可行：

$ awk '
    NR == FNR {Identifiers[$1]; next}
    $2 in Identifiers {
        $1 = ""; $0 = $0; if(!Printed[$0]++) {print}
    }' identifiers_file.txt data_file.txt

鉴于 identifiers_file.txt 之类的（我在其中添加了 NOC2L，因为您的示例中没有匹配的标识符）：

A1BG
A2M                               
A2ML1                       
AAK1                   
ABCA12
ABCA13
ABCA2
ABCA4
ABCC2
NOC2L

那么你的输出将是：

$ awk '
    NR == FNR {Identifiers[$1]; next}
    $2 in Identifiers {
        $1 = ""; $0 = $0; if(!Printed[$0]++) {print}
    }' idents.txt data.txt
 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) A G
 NOC2L nucleolar complex associated 2 homolog (S. cerevisiae) T C

如果这是正确的并且你想要一个 Perl 版本，你可以：

$ echo 'NR == FNR {Identifiers[$1]; next} $2 in Identifiers { $1 = ""; $0 = $0; if(!Printed[$0]++) {print} }' \
    | a2p

【讨论】：

我对 awk 不熟悉，所以我尝试将命令转录到我自己的命令行中，除了命令末尾的文件名之外什么都没改变。收到错误消息：awk：源代码行 1 上下文中的语法错误是 NR == FNR {Identifiers[$1];下一个;} $2 标识符 {$1 = ""; $0 = $0 >>> if
尝试复制和粘贴。 $0 后面没有分号 (;)
谢谢，这成功了。有没有办法修改 awk 脚本，使输出以制表符分隔？

【解决方案3】：

我建议您将第一个版本和第二个版本混合在一起，并为它们添加哈希值。第一个版本，因为它很好（清晰的方式）逐行解析您的数据文件。

#!/usr/bin/perl

use strict;
use warnings;
use autodie;

open (REF, $ARGV[0]);
my %refs;
while (<REF>) {
    my $line = $_;
    chomp $line;
    $refs{$line} = 0;
}
close REF;

#for head printing
$refs{'Symbol'} = 0;

open (FILE, $ARGV[1]); 
while (<FILE>) {
        my $line = $_;
        my @cells = split('\t', $line);
        my $gene = $cells[1];
        #print $line, "\n" if exists $refs{$gene};
        if(exists $refs{$gene} and $refs{$gene} == 0)
        {
            $refs{$gene}++;
            print $line;
        }
}
close FILE;

【讨论】：

文件打不开怎么办。添加use autodie;（以及use warnings;）。
@DavidW。试图像 OP 一样保持简单，但你可能是对的，补充 - tnx)