如何加快两个文件之间的模式匹配答案

【问题标题】：how to speed up pattern match between two files如何加快两个文件之间的模式匹配
【发布时间】：2014-09-15 08:46:53
【问题描述】：

输入文件1：

col1    col2    col3    col4
ZGLP1   ICAM4   13.27   0.2425
ICAM4   ZGLP1   13.27   0.2425
RRP1B   CDH24   20.8    1
ZGLP1   OOEP    18.79   0.3060
ZGLP1   RRP1B   39.62   0.2972
ZGLP1   CDH24   51.21   0.2560
BBCDI   DND1    19.44   0.2833
BBCDI   SOHLH2  36.61   0.2909
DND1    SOHLH2  18      0.8

输入文件2：

chr8     18640000   18960000    ZGLP1   RRP1B   CDH24  #gene number here is not fixed can be #4 #5 or more
chr8     19000000   19080000    BBCDI   DND1    SOHLH2 #gene number here is not fixed can be #4 #5 or more

我编写了一个代码，将 file1 的 col1 和 col2 与 file2 的每一行进行比较，这样，如果这对中的任何一个落在 file2 行中的任何位置，则程序应打印“染色体 pos1 pos2 和 file1 的匹配内容有价值观

输出文件：

chr8     18640000   18960000    ZGLP1   RRP1B 39.62 0.2972
chr8     18640000   18960000    ZGLP1 CDH24 51.21   0.2560
chr8     18640000   18960000    RRP1B CDH24 20.8    1
chr8     19000000   19080000    BBCDI   DND1 19.44  0.2833
chr8     19000000   19080000    BBCDI SOHLH2 36.61  0.2909
chr8     19000000   19080000    DND1 SOHLH2 18 0.8

到目前为止，我已经尝试过了，但是由于我的输入文件很大 (2gb)，所以需要很长时间。

我的 perl 代码

open( AB, "file1" ) || die("cannot open");
open( BC, "file2" ) || die("cannot open");
open( OUT, ">output.txt" );

@file = <AB>;

chomp(@file);
@data = <BC>;

chomp(@data);

foreach $fl (@file) {
    if ( $fl =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)/ ) {
        $one = $1;
        $two = $2;
        $thr = $3;
        $for = $4;
    }

    foreach $line (@data) {
        if ( $line =~ /(.*?)\s+(.*?)\s+(.*?)\s+(.*)+/ ) {
            $chr  = $1;
            $pos1 = $2;
            $pos2 = $3;
        }

        if ( $line =~ /$one/ ) {
            if ( $line =~ /$two/ ) {
                print OUT $chr, "\t", $pos1, "\t", $pos2, "\t", $fl, "\n";
            }
        }
    }
}

【问题讨论】：

您的空闲内存是否 > 2gb？
目前您的代码非常慢，因为对于file1 的每一行，它都会解析file2 的每一行以检查匹配项。您可以通过只解析每个文件一次来极大地加快速度。我建议您使用其中一个文件中的数据创建索引哈希，然后在解析第二个文件中的数据时参考该哈希值。为什么不先尝试自己解决这个问题，如果解决不了再寻求帮助？
是的，我正在开发具有 >100 GB 物理内存的 16 GB RAM 系统。
基准正则表达式与拆分，使用index() 匹配$one 和$two 字符串。如果您有足够的内存，请在 print OUT 之前进行第二次匹配/拆分并缓存结果。并且不需要将第一个文件读入数组。
@ialarmedalien 我是 perl 的初学者，请您详细说明在这种情况下如何使用索引哈希

标签： regex perl awk

【解决方案1】：

$ cat tst.awk               
NR==FNR {
    if (NR>1)
        file1[$1,$2] = $0
    next
}
{
    for (i=3; i<=NF; i++)
        for (j=3; j<=NF; j++)
            if ( ($i,$j) in file1 )
                print $1, $2, $3, file1[$i,$j]
}
$ 
$ awk -f tst.awk file1 file2
chr8 18640000 18960000 ZGLP1   RRP1B   39.62   0.2972
chr8 18640000 18960000 ZGLP1   CDH24   51.21   0.2560
chr8 18640000 18960000 RRP1B   CDH24   20.8    1
chr8 19000000 19080000 BBCDI   DND1    19.44   0.2833
chr8 19000000 19080000 BBCDI   SOHLH2  36.61   0.2909
chr8 19000000 19080000 DND1    SOHLH2  18      0.8

【讨论】：

非常感谢 awk 解决方案，但这不是我想要的解决方案。 ZGLP1 ICAM4 13.27 0.2425 不会来，因为input file 2 ZGLP1 RRP1B and CDH24 在一起。所以程序必须只检查输出文件中列出的那些对
抱歉不清楚。我的意思是说如果文件 1 的任何基因对出现在文件 2 的同一行中，那么只有我想打印。 chr8 18640000 18960000 ZGLP1 RRP1B CDH24 在同一行中没有 ICAM4。
ya 它的基因对。也很抱歉没有提供正确的输入文件。正如我现在编辑的那样，我的inputfile2 可以在同一行中包含 3 个以上的基因。我在你的代码中注意到了。我猜它被认为是基因数是固定的。

【解决方案2】：

加快代码速度的几种方法：

先读入解析文件1，创建索引：

my %ix;
while (<AB>) {
    # skip the first line (with the column headers)
    next if $. == 1;
    chomp;
    # assuming that the data is tab-separated; if not, you can run split /\s+/
    my @arr = split "\t";
    # create a hash with structure $ix{col1}{col2} = "col3  col4"
    $ix{ $arr[0] }{ $arr[1] } = $arr[2] . "\t" . $arr[3];
}

现在读取文件 2，一次一行，并查找匹配项：

while (<BC>) {
    chomp;
    # initialise a set of variables all at once
    # assumes the data is tab-delimited; if it isn't, use split /\s+/
    my ($chr, $pos1, $pos2, $g1, $g2, $g3) = split "\t";

    # $g1, $g2, and $g3 are the three IDs on the line. This code assumes they will
    # always be in the order that they appear in file 1.
    # look for $g1 in our index. if ( $ix{$g1} ) is shorthand for checking if a
    # variable is defined and is non-zero.
    if ( $ix{$g1} ) {
        # now, for each of $g2 and $g3
        foreach my $g ($g2, $g3) {
            # ... check whether we've got an index entry where it is the second key
            if ( $ix{$g1}{$g} ) {
                # print out the data joined by tabs
                print OUT join("\t", $chr, $pos1, $pos2, $g1, $g, $ix{$g1}{$g}) . "\n";
            }
        }
    }
    # do the same check for $g2 and $g3. We have to check whether $ix{$g2} exists
    # first as if we check $ix{$g2}{$g3} directly and $ix{$g2} DOESN'T exist,
    # Perl will create it. This is known as autovivification.
    if ($ix{$g2} && $ix{$g2}{$g3}) {
        print OUT join("\t", $chr, $pos1, $pos2, $g2, $g3, $ix{$g2}{$g3}) . "\n";
    }
}

【讨论】：