【问题标题】：Problems with comparing 2 strings in Perl在 Perl 中比较 2 个字符串的问题
【发布时间】：2012-02-22 15:57:15
【问题描述】：

我想使用 Perl 计算 2 个文件之间存在的公共行数。

我有 1 个基本文件用于比较 fileA 中是否存在所有行（由换行符 \n 分隔）。我所做的是将基本文件中的所有行放入 base_config 哈希中，并将 fileA 中的行放入 config 哈希中。我想比较 %config 中的所有键，它也可以在 %base_config 的键中找到。为了更有效地比较键，我对 %base_config 中的键进行了排序，并将它们放入 @sorted_base_config 中。

但是，对于某些行完全相同但顺序不同的文件，我无法获得正确的计数。例如，基本文件包含：

hello
hi
tired
sleepy

而 fileA 包含：

hi
tired
sleepy
hello

我能够从文件中读取值并将它们放入各自的散列和数组中。以下是出错的部分代码：

$count=0;
while(($key,$value)=each(%config))
{
    foreach (@sorted_base_config) 
    {
        print "config: $config{$key}\n";
                print "\$_: $_\n";
        if($config{$key} eq $_)
        {
            $count++;
        }
    }
}

如果我犯了任何错误，有人可以告诉我吗？计数应该是 4，但它一直打印 2。

编辑：这是我的原始代码不起作用。它看起来完全不同，因为我尝试使用不同的方法来解决问题。但是，我仍然遇到同样的问题。

#open base config file and load them into the base_config hash
open BASE_CONFIG_FILE, "< script/base.txt" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $base_config{$word1} = $word1;
}
#sort BASE_CONFIG_FILE
@sorted_base_config = sort keys %base_config;

#open config file and load them into the config hash
open CONFIG_FILE, "< script/hello.txt" or die;
my %config;
while (my $line=<CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $config{$word1} = $word1;
}
#sort CONFIG_FILE
@sorted_config = sort keys %config;

%common={};
$count=0;
while(($key,$value)=each(%config))
{
    $num=keys(%base_config);
    $num--;#to get the correct index
    #print "$num\n";
    while($num>=0)
    {
        #check if all the strings in BASE_CONFIG_FILE can be found in CONFIG_FILE
        $common{$value}=$value if exists $base_config{$key};
        #print "yes!\n" if exists $base_config{$key};
        $num--;
    }
}
print "count: $count\n";

while(($key,$value)=each(%common))
{
    print "key: ".$key."\n";
    print "value: ".$value."\n";
}
$num=keys(%common)-1;
print "common lines: ".$num;

之前，我将 base_config 文件和 fileA 中存在的公用密钥推送到 %common 中。我想将来将公用密钥打印到 txt 文件中，并且在 fileA 中找到但在 base_config 文件中找不到的任何内容都将输出到另一个 txt 文件。但是，我已经陷入了寻找公共密钥的初始阶段。

我正在使用 "\n" 拆分为存储键，因此我不能使用将删除 "\n" 的 chomp 函数。

编辑 2：我刚刚意识到我的代码有什么问题。在我的 txt 文件末尾，我需要添加“\n”以使其工作。感谢你的帮助！ :D

【问题讨论】：

请问您认为$word1 和$word2 将设置为什么？如果你 split /\n/ 在一行上，那只会做 chomp 所做的事情，除非你改变了 $/。
我认为 $word1 包含没有 \n 的字符串，而 $word2 包含 \n 因为我是按 \n 拆分的。
那是错误的。使用Data::Dumper模块打印值，你会看到split只返回一个值，另一个值是空字符串""。 split 剥离它在 on 上拆分的值。这使您的拆分声明毫无意义。

标签： string perl file

【解决方案1】：

我认为你对效率的尝试实际上是在减慢速度。

my %listA;

# Read first file (name in $NameA)
{
    open my $fileA, '<', "$NameA" or die $!;
    while (<$fileA>)
    {
        chomp;
        $listA{$_}++;
    }
}

# Read second file (name in $NameB)
{
    open my $fileB, '<', "$NameB" or die $!;
    while (<$fileB>)
    {
        chomp;
        if ($listA{$_})
        {
            print "Line appears in $NameB once and $listA{$_} times in $NameA: $_\n";
        }
    }
}

如果您也想将第二个文件读入哈希，那么这也可以：

现在，如果两个文件中都出现了特定的行，它将被列出。请注意，即使我按排序顺序显示键，我也使用哈希查找，因为这样会更快地通过两个排序数组进行混洗。当然，您很难衡量 4 行文件的任何差异。对于大文件，读取文件和打印结果的 I/O 时间可能会支配查找时间。

my %listB;

# Read second file (name in $NameB)
{
    open my $fileB, '<', "$NameB" or die $!;
    while (<$fileB>)
    {
        chomp;
        $listB{$_}++;
    }
}

foreach my $key (sort keys %listA)
{
    if ($listB{$key})
    {
        print "$NameA: $listA{$key}; $NameB: $listB{$key}; $key\n";
    }
}

根据需要重新组织输出。

~~未经测试的代码！~~现已测试的代码 - 见下文。

转换为测试代码

数据：文件A

hello
hi
tired
sleepy

数据：文件B

hi
tired
sleepy
hello

程序：ppp.pl

#!/usr/bin/env perl
use strict;
use warnings;

my $NameA = "fileA";
my $NameB = "fileB";

my %listA;

# Read first file (name in $NameA)
{
    open my $fileA, '<', "$NameA" or die "Failed to open $NameA: $!\n";
    while (<$fileA>)
    {
        chomp;
        $listA{$_}++;
    }
}

# Read second file (name in $NameB)
{
    open my $fileB, '<', "$NameB" or die "Failed to open $NameB: $!\n";
    while (<$fileB>)
    {
        chomp;
        if ($listA{$_})
        {
            print "Line appears in $NameB once and $listA{$_} times in $NameA: $_\n";
        }
    }
}

输出

$ perl ppp.pl
Line appears in fileB once and 1 times in fileA: hi
Line appears in fileB once and 1 times in fileA: tired
Line appears in fileB once and 1 times in fileA: sleepy
Line appears in fileB once and 1 times in fileA: hello
$

请注意，这是按 fileB 的顺序列出的，因为它应该考虑到循环读取 fileB 并依次检查每一行。

代码：qqq.pl

这是变成完整工作程序的第二个片段。

#!/usr/bin/env perl
use strict;
use warnings;

my $NameA = "fileA";
my $NameB = "fileB";

my %listA;

# Read first file (name in $NameA)
{
    open my $fileA, '<', "$NameA" or die "Failed to open $NameA: $!\n";
    while (<$fileA>)
    {
        chomp;
        $listA{$_}++;
    }
}

my %listB;

# Read second file (name in $NameB)
{
    open my $fileB, '<', "$NameB" or die "Failed to open $NameB: $!\n";
    while (<$fileB>)
    {
        chomp;
        $listB{$_}++;
    }
}

foreach my $key (sort keys %listA)
{
    if ($listB{$key})
    {
        print "$NameA: $listA{$key}; $NameB: $listB{$key}; $key\n";
    }
}

输出：

$ perl qqq.pl
fileA: 1; fileB: 1; hello
fileA: 1; fileB: 1; hi
fileA: 1; fileB: 1; sleepy
fileA: 1; fileB: 1; tired
$

请注意，键是按排序顺序列出的，而不是 fileA 或 fileB 中的顺序。

小奇迹偶尔会发生！除了添加 5 行序言（shebang、2 x using、2 x my）之外，根据我对这两个程序的第一次估算，这两个程序片段的代码都是正确的。（哦，我改进了无法打开文件的错误消息，至少确定了我无法打开哪个文件。ikegami 编辑了我的代码（谢谢！）以一致地添加chomp 调用，并将换行符添加到print 操作现在需要显式换行符。）

我不会说这是很棒的 Perl 代码；它肯定不会赢得（代码）高尔夫比赛。不过，它似乎确实有效。

问题代码分析

open BASE_CONFIG_FILE, "< script/base.txt" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $base_config{$word1} = $word1;
}

拆分很奇怪...您有一行以换行符结尾，并且您在换行符处拆分，因此$word2 为空，$word1 包含该行的其余部分。然后将值$word1（不是我乍一看以为的$word2）存储到基本配置中。因此，每个条目的键和值都是相同的。异常。实际上并没有错，但是……不寻常。第二个循环本质上是相同的（我们都应该因为没有使用单个潜艇为我们做阅读而被枪杀）。

您不能使用use strict; 和use warnings; - 请注意，实际上我对代码所做的第一件事就是添加它们。我只用 Perl 编程了大约 20 年，而且我知道我的知识不足以冒险在没有它们的情况下运行代码。您的排序数组%common、$count、$num、$key、$value 不是my'd。这次可能不会造成太大的伤害，但是……这是一个不好的迹象。始终，但始终，使用use strict; use warnings;，直到您对 Perl 有足够的了解，无需提出任何问题（也不要指望很快）。

当我运行它的时候，有：

my %common={};  # line 32 - I added diagnostic printing
my $count=0;

Perl 告诉我：

Reference found where even-sized list expected at rrr.pl line 32, <CONFIG_FILE> line 4.

糟糕 - 那些 {} 应该是一个空列表 ()。看看你为什么在启用警告的情况下运行！

然后，在

 50 while(my($key,$value)=each(%common))
 51 {
 52     print "key: ".$key."\n";
 53     print "value: ".$value."\n";
 54 }

Perl 告诉我：

key: HASH(0x100827720)
Use of uninitialized value $value in concatenation (.) or string at rrr.pl line 53, <CONFIG_FILE> line 4.

这是%common 中的第一个条目，用于循环扔东西。

固定代码：`rrr.pl`

#!/usr/bin/env perl
use strict;
use warnings;

#open base config file and load them into the base_config hash
open BASE_CONFIG_FILE, "< fileA" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $base_config{$word1} = $word1;
   print "w1 = <<$word1>>; w2 = <<$word2>>\n";
}

{ print "First file:\n"; foreach my $key (sort keys %base_config) { print "$key => $base_config{$key}\n"; } }

#sort BASE_CONFIG_FILE
my @sorted_base_config = sort keys %base_config;

#open config file and load them into the config hash
open CONFIG_FILE, "< fileB" or die;
my %config;
while (my $line=<CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $config{$word1} = $word1;
   print "w1 = <<$word1>>; w2 = <<$word2>>\n";
}
#sort CONFIG_FILE
my @sorted_config = sort keys %config;

{ print "Second file:\n"; foreach my $key (sort keys %base_config) { print "$key => $base_config{$key}\n"; } }

my %common=();
my $count=0;
while(my($key,$value)=each(%config))
{
    print "Loop: $key = $value\n";
    my $num=keys(%base_config);
    $num--;#to get the correct index
    #print "$num\n";
    while($num>=0)
    {
        #check if all the strings in BASE_CONFIG_FILE can be found in CONFIG_FILE
        $common{$value}=$value if exists $base_config{$key};
        #print "yes!\n" if exists $base_config{$key};
        $num--;
    }
}
print "count: $count\n";

while(my($key,$value)=each(%common))
{
    print "key: $key -- value: $value\n";
}
my $num=keys(%common);
print "common lines: $num\n";

输出：

$ perl rrr.pl
w1 = <<hello>>; w2 = <<>>
w1 = <<hi>>; w2 = <<>>
w1 = <<tired>>; w2 = <<>>
w1 = <<sleepy>>; w2 = <<>>
First file:
hello => hello
hi => hi
sleepy => sleepy
tired => tired
w1 = <<hi>>; w2 = <<>>
w1 = <<tired>>; w2 = <<>>
w1 = <<sleepy>>; w2 = <<>>
w1 = <<hello>>; w2 = <<>>
Second file:
hello => hello
hi => hi
sleepy => sleepy
tired => tired
Loop: hi = hi
Loop: hello = hello
Loop: tired = tired
Loop: sleepy = sleepy
count: 0
key: hi -- value: hi
key: tired -- value: tired
key: hello -- value: hello
key: sleepy -- value: sleepy
common lines: 4
$

【讨论】：

嗨！感谢您的快速反应。但是，计数仍然是 2，它也打印出与我相同的结果（即显示“hi”和“tired”）。
我看到你不是chomping 输出。所以“sleepy”与“sleepy\n”进行比较
@Unos：查看更新 - 它似乎对我显示的数据文件有用（这是你的例子）。我已经放入了完整的程序，尽管正如我在附录中指出的那样，它们是与固定的 5 行标题一起粘贴的原始代码片段。他们按我的预期工作（列出所有四行/单词）。请注意，您需要与 chomping 或不 chomping 输入保持一致。我一直不咀嚼。
@Sakura，您的文件不会以换行符结尾（它们确实应该如此）。您可以在阅读时在每一行上使用chomp 来解决这个问题。我已经调整了乔纳森的代码。
嗨。由于某些未知原因，在我复制了 ikegami 的代码和 Jonathan 的代码后，它仍然给了我同样的错误。难道是因为我的txt文件？目前我把“hello\nhi\ntired\nsleepy”放在每个“\n”表示一个新行的地方。我在文档末尾没有“\n”。

【解决方案2】：

也许这不是您正在寻找的方法，但如果您更像这样：

#!/usr/bin/perl
use Data::Dumper;
use warnings;
use strict;

my @sorted_base_config = qw(hello hi tired sleepy);
my @file_a = qw(hi tired sleepy hello);
my @found_in_both = ();

foreach (@sorted_base_config) {
  if (grep /$_/, @file_a) {
    push(@found_in_both, $_);
  }
}

print "These items were found in file_a:\n";
print Dumper(@found_in_both);

基本上，不要做键/值哈希的事情......为什么不尝试使用两个数组并使用foreach 作为基本文件数组。当您遍历@sorted_base_config 的每一行时，您会检查是否可以在@file_a 中找到该字符串。

如何将文件放入@sorted_base_config 和@file_a 数组（以及如何处理换行符或换行符）由您决定。但至少通过这种方式，它似乎得到了更准确地检查匹配的单词。

【讨论】：

如果人们按字母顺序工作，那么人们在疲倦之前会感到困倦，这样评论是否公平，所以排序的基本配置实际上并没有排序？当然，使用grep，实际上是否排序并不重要。
文件被放入数组（和排序）的方式仍然取决于@Sakura...如我上一段所述。

【解决方案3】：

没有看到您如何定义和填充 %config 和 @sorted_base_config 变量，我不确定是什么导致您的代码失败。如果您提供运行上述代码的输出，那将更加明显。

我没有像其他答案那样提供全新的方法，而是尝试“修复”您的方法，但我的方法没有问题。这意味着错误实际上在于您填充变量的方式，而不是您的检查方式。

为了简单地匹配您的代码，我将键和值都指定为从文件中读取的内容。

这段代码：

#!C:\Perl\bin\perl
use strict;
use warnings;

my $f1 = $ARGV[0];
my $f2 = $ARGV[1];
my %config_base;
my %config;
my $line;
print "F1 = $f1\nF2 = $f2\n";

open F1, '<', $f1 || die;
while ($line = <F1>) {
chomp $line;
print "adding $line\n";
$config_base{$line}=$line;
}
close F1;
open F2, '<', $f2 || die;
while ($line = <F2>) {
chomp $line;
print "adding $line\n";
$config{$line}=$line;
}
close F2;
my $count=0;
my $key; my $value;
my @sorted_base_config = sort keys %config_base;
while(($key,$value)=each(%config))
{
    foreach (@sorted_base_config) 
    {
        print "config: $config{$key}\n";
                print "\$_: $_\n";
        if($config{$key} eq $_)
        {
            $count++;
        }
    }
}
print "Count = $count\n";

输出结果：

F1 = config_base.txt
F2 = config.txt
adding hello
adding hi
adding tired
adding sleepy
adding hi
adding tired
adding sleepy
adding hello
config: hi
$_: hello
config: hi
$_: hi
config: hi
$_: sleepy
config: hi
$_: tired
config: hello
$_: hello
config: hello
$_: hi
config: hello
$_: sleepy
config: hello
$_: tired
config: tired
$_: hello
config: tired
$_: hi
config: tired
$_: sleepy
config: tired
$_: tired
config: sleepy
$_: hello
config: sleepy
$_: hi
config: sleepy
$_: sleepy
config: sleepy
$_: tired
Count = 4

但是，Johnathan 的答案比您开始时的方法更好。至少，使用存在来比较 2 个输入哈希的键远比针对键数组的嵌套循环要好得多。该循环破坏了使用哈希开始的效率。

在这种情况下，你会得到类似的东西：

foreach my $key (keys %config_base) 
    {
        print "config: $config{$key}\n";
                print "\$_: $key\n";
        if(exists $config{$key})
        {
            $count++;
        }
    }
print "Count = $count\n";

【讨论】：

看到你的代码后，我想如果我显示我的原始代码也有同样的错误会更好。我正在使用 "\n" 拆分为用于存储的键，因此我不能使用将删除 "\n" 的 chomp 函数。

【解决方案4】：

使用List::Compare

【讨论】：

转换为测试代码

数据：文件A

数据：文件B

程序：ppp.pl

输出

代码：qqq.pl

输出：

问题代码分析

固定代码：rrr.pl

输出：

固定代码：`rrr.pl`