如何提高 perl 脚本的性能？答案

【问题标题】：How to improve perl script performance?如何提高 perl 脚本的性能？
【发布时间】：2013-02-01 10:03:40
【问题描述】：

我正在运行 ucm2.pl 脚本来扫描一个巨大的目录结构（目录是映射到本地的网络驱动器）。我有两个 perl 脚本 ucm1.pl 和 ucm2.pl。我正在为不同的参数并行运行 ucm2.pl，它是通过 ucm1.pl 调用的。

ucm1.pl -

    #!/usr/bin/perl
    use strict; 
    use warnings;
    use Parallel::ForkManager;

    my $filename ="intfSplitList.txt"; #(this will have list of all the input files. eg intfSplit_0....intfSplit_50)
     my $lines;
     my $buffer;
        open(FILE, $filename) or die "Can't open `$filename': $!";
        while (<FILE>) {
            $lines = $.;
        }
        close FILE;
    print "The number of lines in $filename is $lines \n";


    my $pm = Parallel::ForkManager->new($lines); #(it will set the no. of parallel processes)

    open (my $fh, '<', "intfSplitList.txt") or die $!;
    while (my $data = <$fh>) {
      chomp $data;

      my $pid = $pm->start and next;

     system ("perl ucm2.pl -iinput.txt -f$data");  
#(call the ucm2.pl) #(input.txt file will have search keyword and $data will have intfSplit_*.txt files)

      $pm->finish; # Terminates the child process
    }

ucm2.pl 代码-

#!/usr/bin/perl
use strict;
use warnings;  
use File::Find;
use Getopt::Std;
#getting the input parameters
getopts('i:f:');

our($opt_i, $opt_f);
my $searchKeyword     = $opt_i;                               #Search keyword file.
my $intfSplit         = $opt_f;                               #split file
my $path              = "Z:/aims/";                           #source directory
my $searchString;                                             #search keyword

open FH, ">log.txt";                                          #open the log file to write

print FH "$intfSplit ". "started at ".(localtime)."\n";       #write the log file

open (FILE,$intfSplit);                                       #open the split file to read

while(<FILE>){

   my $intf= $_;                                             #setting the interface to intf
   chomp($intf);
   my $dir = $path.$intf;
   chomp($dir);
   print "$dir \n";                                              
   open(INP,$searchKeyword);                         #open the search keyword file to read

   while (<INP>){      

   $searchString =$_;                           #setting the search keyword to searchString
   chomp($searchString);
   print "$searchString \n";
   open my $out, ">", "vob$intfSplit.txt" or die $!; #open the vobintfSplit_* file to write

#calling subroutine printFile to find and print the path of element
find(\&printFile,$dir);                                       

#the subroutine will search for the keyword and print the path if keyword is exist in file.
sub printFile {
   my $element = $_;

   if(-f $element && $element =~ /\.*$/){ 

      open my $in, "<", $element or die $!;
      while(<$in>) {
         if (/\Q$searchString\E/) {
            my $last_update_time = (stat($element))[9];
            my $timestamp  = localtime($last_update_time);
            print $out "$File::Find::name". "     $timestamp". "     $searchString\n";
            last;
          }
        }
      }
    }
  }
}
print FH "$intfSplit ". "ended at ".(localtime)."\n";         #write the log file

一切运行良好，但单个关键字搜索也运行了很长时间。谁能建议一些更好的方法来提高性能。

提前致谢！！

【问题讨论】：

你做了什么计时和仪器？提高绩效的第一步是衡量当前绩效。
@MartinSkøtt 是完全正确的。在进行任何有意义的优化之前，您需要弄清楚哪些位实际上很慢 - 否则您只是在猜测，这可能会成功，但不太可能让您尽可能地走得更远。通常对于并行问题，要检查的事情是您是否使用了正确数量的并行性（启动子项的时间与处理时间与系统上可用的并行资源）以及子项是否以某种方式竞争相同的资源这会降低它们的并行度。
您阅读intfSplitList.txt 两次只是为了数行数。这不是性能杀手，但没有必要。此外，我相信您的所有进程都会覆盖彼此的日志文件，因为它们都打开同一个文件进行读取，而不是附加。您应该为它们中的每一个使用一个日志文件，并在以后合并它们。您可以通过简单地添加带有微秒的时间戳，然后将所有文件连接在一起并对时间戳进行排序来做到这一点。这里还有更多可以优化的东西，但我同意其他的。弄清楚什么是慢的。查看 Devel::NYTProf.
我听说过一些关于 perl 中的二分搜索。我不确定我们是否可以在这里使用它。谁能知道在这种情况下是否可以使用二进制搜索来提高性能？？

标签： performance perl parallel-processing

【解决方案1】：

运行多个 Perl 实例会增加很多不必要的开销。你看过my answer to your previous question，它建议改变这个吗？

正如我之前提到的，您在这里有一些不必要的重复：没有理由多次打开和处理您的搜索关键字文件。您可以制作一个打开关键字文件并将关键字放入数组的子程序。然后将这些关键字传递给另一个进行搜索的子。

您可以通过一次搜索所有关键字来更快地搜索多个关键字。做这样的事情来获取你的关键字：

my @keywords = map {chomp;$_} <$fh>;
my $regex = "(" . join('|', map {quotemeta} @keywords) . ")";

现在你有一个像这样的正则表达式：(\Qkeyword1\E|\Qkeyword2\E)。您只需搜索一次文件，如果您想查看匹配的关键字，只需检查$1 的内容即可。这不会加快单个关键字的速度，但搜索多个关键字的速度几乎与搜索单个关键字一样快。

不过，最终，如果您要在网络上搜索一个巨大的目录结构，那么您可以加快速度可能会受到限制。

更新：更正了咀嚼。谢谢阿蒙。

【讨论】：

chomp 的返回值是删除的字符数，或者类似的东西。根据我的经验，map {chomp; $_} <$fh> 效果更好。
@amon，感谢您的关注。我已经多次犯了这个特别愚蠢的错误！（不过，我希望 chomp 能达到我的预期。我认为在这种情况下，它违反了 Perl 的返回值规则：“一般来说，他们会做你想做的事”。谁在乎删除了多少字符？）跨度>
@dan1111 谢谢。我在这里使用正则表达式进行多个关键字搜索及其预期工作。但我还有一个问题，我听说过 perl 中的二进制搜索。我们可以在这里使用它以获得更好的性能吗？？
@Dhiraj，二进制搜索不适合这种情况。首先，它需要随机访问，而文本文件不允许这样做（一般来说）。其次，二分查找需要排序的数据。第三，当您搜索字符串的子集时它不起作用，该子集可能出现在每行的任何位置。您的问题本质上要求您搜索所有文件。