Perl：遍历大哈希，内存不足答案

【问题标题】：Perl: Iterating through large hash, runs out of memoryPerl：遍历大哈希，内存不足
【发布时间】：2015-08-26 18:07:41
【问题描述】：

我一直在尝试查找在大文件的两列（a 列和 b 列）之间匹配的值，并打印公共值以及相应的 d 列。我一直在通过哈希进行交互来做到这一点，但是，因为文件太大，没有足够的内存来生成输出文件。有没有其他方法可以使用更少的内存资源来做同样的事情。

非常感谢任何帮助。

目前我写的脚本如下：

#!usr/bin/perl
use warnings;
use strict;

open (FILE1, "<input.txt") || die "$!\n Couldn't open input.txt\n";
open (Output, ">output.txt")||die "Can't Open output.txt ";
my $hash1={};
my $hash2={};

while (<FILE1>) {
    chomp (my $line=$_);
    my ($a, $b, $c, $d) = split (/\t/, $line);

    if ($a) {
        $hash1->{$a}{info1} = "$d"; #original_ID-> YOB
    }
    if ($b) {
        $hash2->{$b}{info2} = "$a"; #original_ID-> sire
    }

    foreach my $key (keys %$hash2) {
        if (exists $hash1{$a}) {
            $info1 = $hash1->{$a}->{info1};
            print "$a\t$info1\n";
        }
    }
}

close FILE1;
close Output;
print "Done\n";

为了澄清，输入文件是一个大的谱系文件。一个例子是：

1    2   3   1977
2    4   5   1944
3    4   5   1950
4    5   6   1930
5    7   6   1928

输出文件的一个例子是：

2   1944
4   1950
5   1928

【问题讨论】：

你能提供输入文件的小sn-p和所需的输出吗？
perl.goeszen.com/working-with-very-large-hashes.html 和旁注：您可以摆脱额外级别的哈希 info1,info2
@GeorgiRangelov，我在我的原始帖子中添加了输入和输出的示例。
对于跨多列匹配的大量数据，我建议将数据放入 SQLite 数据库并执行 SQL 查询。它将更快、更灵活、更高效。
程序不会从输入中产生输出，如图所示；您应该提供一致的信息。此外，您还不清楚对应的d列是什么意思：与a列匹配的行中的d列，或与b列匹配的行中的d列。

标签： perl iteration perl-hash

【解决方案1】：

以下内容对你有用吗？

#!/usr/local/bin/perl

use strict;
use warnings;
use DBM::Deep;
use List::MoreUtils qw(uniq);

my @seen;

my $db = DBM::Deep->new(
    file => "foo.db",
    autoflush => 1
);

while (<>) {
    chomp;
    my @fields = split /\s+/;
    $$db{$fields[0]} = $fields[3];
    push @seen, $fields[1];
}

for (uniq @seen) {
    print $_ . " " . $$db{$_} . "\n" if exists $$db{$_};
}

【讨论】：

仅供参考，dbm-deep 在某处有一个memory leak。根据我的经验，它不会真正减少大文件的内存使用量，只会减慢速度。