Perl 自动化散列填充和键名答案

【问题标题】：Perl Automating Hash Population and Key NamePerl 自动化散列填充和键名
【发布时间】：2020-08-27 12:01:22
【问题描述】：

我有许多错误字符串。我将它们与我已经拥有的模式相匹配。如果它们完全相同，我希望它们属于完全相同的错误失败。如果它们与模式匹配但与哈希中的先前字符串有一些不同，我想给它相同的错误名称，但附加一个不同的数字。

这是一个示例输入文件：

there are 5 syntax issues with semicolon
there are 11 syntax issues with semicolon
the file contains 5 formatting issues
there are 1 syntax issues with semicolon
check script for formatting issues
2 syntax issues have been found
the file contains 1 formatting issues
6 syntax issues have been found

use warnings;
use strict;

my %errors;
my $file = "listoferrormessages.txt"

open my $fh,'<',$file or die "Could not open $file: $!";

while(my $line = <$fh>){

if( $line =~ /syntax/){

    if ($line =~ /there are \d syntax issues with semicolon/){
       #if line matching format exists in hash values, continue
       #if not, create a hash key called syntax_# where # increments one from the last key with the error name. 
        $errors{errorname} = $line;
}

    elsif ($line =~ /\d syntax issues have been found/){
       #same as above
       $errors(errorname} = $line;
}

elsif ($line =~ /format/){
#same as above
}

}
close $fh;

我希望我的哈希看起来像：

$VAR1 = {
          'syntax_1' => 
                     'there are 5 syntax issues with semicolon',
          'syntax_2' => 
                     '2 syntax issues have been found',
          'format_1' => 
                     'the file contains 5 formatting issues',
          'format_2' => 
                     'check script for formatting issues'
        };

对此的任何指导都会非常有帮助。还有很多我想补充的，但我对如何开始这样做感到很困惑。这甚至可以做到吗？

【问题讨论】：

您是否考虑过拥有一个键 syntax，其值是一个包含所有此类消息的数组引用，以及一个带有数组引用的键 format？等等。
@zdim 每次消息匹配不同的模式时，我如何获得它来创建一个新密钥？
但是——所有这些从何而来？也许您可以更好地组织该过程，以便将它们全部整理出来。有很多很棒的模块可以记录各种消息、错误等等。
向我提供了包含所有错误的日志文件。但是，对于每个键，正如我所提到的，如果模式不同，我希望它自动创建一个带有硬编码名称的新存储桶名称，但在末尾附加一个 +1 数字。这可能吗？
发布了一个答案，但我想得到一些澄清，以使其更好。如答案中所述，主要是关于您期望什么样的错误消息

标签： perl

【解决方案1】：

这完成了所要求的工作，剩下的问题是可能的错误类型。

一个辅助数据结构 (%seen_error_type) 用于避免在每一行中搜索值，以检查是否已看到该错误类型；有了这个哈希，它只是一个查找。

use warnings;
use strict;
use feature qw(say);

use Data::Dump qw(dd);  # to show complex data structures

my $file = shift // die "Usage: $0 file\n";  #/
open my $fh, '<', $file  or die "Can't open $file: $!";

my (%error, %seen_error_type, $cnt_syntax, $cnt_format);

LINE:
while (my $line = <$fh>) { 
    chomp $line;

    my $error_type = $line =~ s/[0-9]+/N/r;  # extract error type

    next LINE if exists $seen_error_type{$error_type};
    $seen_error_type{$error_type} = 1;

    if ($line =~ /syntax/) {
        ++$cnt_syntax;
        $error{ "syntax_$cnt_syntax" } = $line;
    }
    elsif ($line =~ /format/) {
        ++$cnt_format;
        $error{ "format_$cnt_format" } = $line;
    }   
    else { }  # specify how to handle unexpected error types
}       
    
dd \%error;

错误“类型”首先从一行构建，通过将数字替换为N；这仅遵循 OP 示例，因为没有给出如何对这些错误消息进行分类的规则。如果这确实是全部，那很好。但我预计会出现更复杂的各种错误标准。

改进这一点的关键是阐明预期的“错误类型”（错误消息的结构）的规则。

除非我们有一些关于如何从一行中提取模式的规则，否则简单地将意外模式添加到我们的错误类型的簿记哈希中是没有意义的。否则，每行可能的文本最终都可能成为其自身的关键，这将破坏对它们进行分类的整个练习的目的。

使用给定的输入文件，上面的打印结果

{ format_1 => "该文件包含 5 个格式问题", format_2 => "检查脚本的格式问题", syntax_1 => "分号有 5 个语法问题", syntax_2 => "已发现 2 个语法问题", }

（我使用的Data::Dump模块可能需要安装。核心选项是Data::Dumper）

在 cmets 中提出的另一个注意事项：我不明白为什么要为每个新行添加一个键，而不是将每个预期的错误类型行添加到 arrayref 以获得合适的键（syntax、format等）。

如果没有具体原因，那么我宁愿提出类似的建议

my (%error, %seen_error_type);

LINE:
while (my $line = <$fh>) { 
    chomp $line;

    my $error_type = $line =~ s/[0-9]+/N/r;  # extract error type

    next LINE if exists $seen_error_type{$error_type};
    $seen_error_type{$error_type} = 1;

    if ($line =~ /syntax/) {
        push @{$error{syntax}}, $line;
    }   
    elsif ($line =~ /format/) { 
        push @{$error{format}}, $line;
    }
    else { }  # specify how to handle unexpected error types
}

dd \%error;

现在我们只需为键 syntax 提供一个数组引用，并为键 format 提供另一个数组引用。

打印出来

{ 格式 => [ “该文件包含 5 个格式问题”， “检查脚本的格式问题”， ], 语法 => [ “分号有 5 个语法问题”， "已发现 2 个语法问题", ], }

【讨论】：

【解决方案2】：

zdim's program 是可行的，但如果你的情况更复杂，while 循环会变得混乱。您可以使用更好的模式，这样您就可以继续添加模式。 Polar Bear 接近了，但仍有额外的特殊知识融入循环中。

创建一个包含您想要匹配的各种事物的表格。将这些信息移出循环有几个优点。首先，它使循环更简单。其次，更容易看到并行结构。第三，这离在配置中存储匹配器信息又近了一步。数组中的顺序就是我要测试它们的顺序：

my @matchers = (
    #  label    pattern
    [ 'syntax', qr/syntax/ ],
    [ 'format', qr/format/ ],
    );

然后循环变成这样。这个版本的while 适用于你想定义的尽可能多的匹配器。这个while 没有关于输入或匹配的特殊知识。它的工作是将所有数据消化成您以后可以轻松管理的东西，并且不会丢失信息：

use v5.26;
my %hash;

LINE:
while (my $line = <$fh>) {
    chomp $line;

    foreach my $matcher ( @matchers ) {
        next unless $line =~ m/$matcher->[1]/;
        my( $n ) = $line =~ /(\d+)/;
        push $hash{ $matcher->[0] }{$n // 0}->@*, $line;
        next LINE; # or not, if you want to sort into multiple categories
        }
    }

现在我有一个数据结构，其中包含错误消息中的数字和所有匹配行的列表。这不一定是您的最终数据结构，但它可以让您到达那里。最终，输出的要求会发生变化，如果您将太多的决定放入while，您将不得不重新开始。相反，无论输出如何，您最终都会得到这个，只需将其转换为您想要结束的任何内容。我当然想看看不同的行是如何分类的。我可以通过这种方式快速找出错误分类：

{
  format => {
              "0" => ["check script for formatting issues"],
              "1" => ["the file contains 1 formatting issues"],
              "5" => ["the file contains 5 formatting issues"],
            },
  syntax => {
              1  => ["there are 1 syntax issues with semicolon"],
              2  => ["2 syntax issues have been found"],
              5  => [
                      "there are 5 syntax issues with semicolon",
                      "there are 5 syntax issues with commas",
                    ],
              6  => ["6 syntax issues have been found"],
              11 => ["there are 11 syntax issues with semicolon"],
            },
}

我很想更进一步，在每一行中包含行号。最后读取的文件句柄的行号是特殊的 var $.:

push $hash{ $matcher->[0] }{$n // 0}->@*, [$line, $.];

数据结构现在多了一层结构。我已经可以知道标签和计数中的行顺序，但现在我知道整个数据结构中的行顺序。如果我愿意，我可以重新创建输入：

{
  format => {
              "0" => [["check script for formatting issues", 6]],
              "1" => [["the file contains 1 formatting issues", 8]],
              "5" => [["the file contains 5 formatting issues", 4]],
            },
  syntax => {
              1  => [["there are 1 syntax issues with semicolon", 5]],
              2  => [["2 syntax issues have been found", 7]],
              5  => [
                      ["there are 5 syntax issues with semicolon", 1],
                      ["there are 5 syntax issues with commas", 2],
                    ],
              6  => [["6 syntax issues have been found", 9]],
              11 => [["there are 11 syntax issues with semicolon", 3]],
            },
}

【讨论】：

【解决方案3】：

查看输入数据，我看到重复的模式/\d+ (syntax|formatting) issues/，这为我们提供了关于我们所研究的类型问题的线索。

为什么不使用它来根据类型将问题分组？

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my $regex = qr/\d+ (syntax|formatting) issues/;
my $issues;

while( <DATA> ) {
    chomp;
    next unless /$re/;

    my $type = $1;
    $type = 'format' if $type =~ /formatting/;

    push @{$issues->{$type}}, $_;
}

say Dumper($issues);

__DATA__
there are 5 syntax issues with semicolon
there are 11 syntax issues with semicolon
the file contains 5 formatting issues
there are 1 syntax issues with semicolon
check script for formatting issues
2 syntax issues have been found
the file contains 1 formatting issues
6 syntax issues have been found

输出

$VAR1 = {
          'format' => [
                        'the file contains 5 formatting issues',
                        'the file contains 1 formatting issues'
                      ],
          'syntax' => [
                        'there are 5 syntax issues with semicolon',
                        'there are 11 syntax issues with semicolon',
                        'there are 1 syntax issues with semicolon',
                        '2 syntax issues have been found',
                        '6 syntax issues have been found'
                      ]
        };

【讨论】：