Perl：有效地计算许多字符串中的许多单词答案

【问题标题】：Perl: counting many words in many strings efficientlyPerl：有效地计算许多字符串中的许多单词
【发布时间】：2015-05-29 13:41:56
【问题描述】：

我经常发现自己需要计算单词在多个文本字符串中出现的次数。当我这样做时，我想知道每个单词在每个文本字符串中分别出现了多少次。

我不相信我的方法非常有效，您能给我的任何帮助都会很棒。

通常，我会编写一个循环，(1) 从 txt 文件中提取文本作为文本字符串，(2) 执行另一个循环，循环使用正则表达式检查我想要计数的单词有多少每次将计数推送到数组时出现给定单词的次数，(3) 将用逗号分隔的计数数组打印到文件中。

这是一个例子：

#create array that holds the list of words I'm looking to count;
@word_list = qw(word1 word2 word3 word4);

#create array that holds the names of the txt files I want to count;
$data_loc = "/data/txt_files_for_counting/"
opendir(DIR1,"$data_loc")||die "CAN'T OPEN DIRECTORY";
my @file_names=readdir(DIR1);


#create place to save results;
$out_path_name = "/output/my_counts.csv";
open (OUT_FILE, ">>", $out_path_name);

#run the loops;
foreach $file(@file_names){
    if ($file=~/^\./)
        {next;}
    #Pull in text from txt filea;
    {
        $P_file = $data_loc."/".$file;
        open (B, "$P_file") or die "can't open the file: $P_file: $!"; 
        $text_of_txt_file = do {local $/; <B>}; 
        close B or die "CANNOT CLOSE $P_file: $!";      
    }

    #preserve the filename so counts are interpretable;
    print OUT_FILE $file;

    foreach $wl_word(@word_list){
        #use regular expression to search for term without any context;
        @finds_p = ();
        @finds_p = $text_of_txt_file =~ m/\b$wl_word\b/g;
        $N_finds = @finds_p;
        print OUT_FILE ",".$N_finds;
    }
    print OUT_FILE ",\n";
}
close(OUT_FILE);

我发现这种方法效率很低（缓慢），因为 txt 文件的数量和我要计算的单词数量不断增加。

有没有更有效的方法来做到这一点？

是否有一个 perl 包可以做到这一点？

在 python 中会更高效吗？（例如，是否有一个 python 包可以做到这一点？）

谢谢！

编辑：注意，我不想计算单词的数量，而是某些单词的存在。因此，这个问题“What's the fastest way to count the number of words in a string in Perl?”的答案并不完全适用。除非我错过了什么。

【问题讨论】：

What's the fastest way to count the number of words in a string in Perl?的可能重复
您是否匹配大小写、完全匹配等...？还有 foo 和 foo! 呢，它们都被认为是 foo 的匹配项吗？
@PadraicCunningham，通常我已经清除了所有标点符号并将所有字符的大小写更改为小写。
那是python中的两行，要数一个字以上吗？
也是单行字还是怎么分隔的？

标签： python regex perl loops

【解决方案1】：

这是我对如何编写代码的看法。我会花点时间解释我的选择，然后更新

总是 use strict 和 use warnings 在您编写的每个 Perl 程序的顶部。您还必须使用my声明每个变量，使其尽可能接近其第一个使用点。这是一个基本的习惯，因为它会揭示许多简单的错误。在您寻求帮助之前，它们也是强制性，因为没有它们，您会被视为疏忽大意
不要评论不言自明的源代码。鼓励对所有内容进行评论是 1970 年代的遗产，并已成为编写糟糕代码的借口。大多数时候，正确使用标识符和空格会比任何注释更好地解释程序的功能
你使用open的三参数形式是正确的，但你也应该使用词法文件句柄。检查每个open 的结果并调用die 如果程序无法在没有访问该文件的情况下合理地继续，则调用die 是至关重要的。 die 字符串必须包含变量 $! 的值才能说出 为什么 open 失败
如果您的程序打开了很多文件，那么使用autodie pragma 通常会更方便，它会为您隐式检查每个 IO 操作
您应该阅读perldoc perlstyle 以熟悉大多数 Perl 程序编写者都熟悉的格式。像
这样的神器
```
if ($file=~/^\./)
        {next;}
```
应该很简单
```
next if $file =~ /^\./;
```

您已经掌握了do { local $/; ... } 习语将整个文件读入内存，但您限制了它的范围。你的块

{
    $P_file = $data_loc."/".$file;
    open (B, "$P_file") or die "can't open the file: $P_file: $!";
    $text_of_txt_file = do {local $/; <B>}; 
    close B or die "CANNOT CLOSE $P_file: $!";      
}

写得更好

my $text_of_txt_file = do {
  open my $fh, '<', $file;
  local $/;
  <$fh>;
};

与其循环遍历单词列表，不如从单词列表中构建正则表达式更快、更简洁。我下面的程序显示了这个

use strict;
use warnings;
use 5.010;
use autodie;

use constant DATA_LOC    => '/data/txt_files_for_counting/';
use constant OUTPUT_FILE => '/output/my_counts.csv';

my @word_list = qw(word1 word2 word3 word4);
my $word_re   = join '|', map quotemeta, @word_list;
$word_re      = qr/$word_re/;

chdir DATA_LOC;

my @text_files = grep -f, glob '*.*';

my @find_counts;

for my $file ( @text_files ) {

  next if $file =~ /^\./;

  my $text = do {
    open my $in_fh, '<', $file;
    local $/;
    <$in_fh>
  }; 

  my $n_finds = $text =~ /\b$word_re\b/g;
  push @find_counts, $n_finds;
}

open my $out_fh, '>', OUTPUT_FILE;
print $out_fh join(',', @find_counts), "\n";
close $out_fh;

【讨论】：

您无法计算每个单词的出现次数，只能计算每个文件所有单词的命中总数。这不是示例输出的内容。
@simbabque：是的，原始打印文件名（没有后面的分隔符）和以逗号分隔并以逗号和换行符结尾的字数列表。我真的不想重现它，我怀疑它是否接近 OP 的需求
@Borodin 我确实需要一个包含每个单词计数的 CSV，而不仅仅是文本字符串中是否出现任何单词的计数。
@Borodin，感谢您的所有建议！
@user1500158：所以大概你想在文件名后面加逗号（它本身可能包含逗号？）并且在最后一个字数之后没有逗号？计数可以单独出现，没有任何方式将它们与相应的词联系起来吗？

【解决方案2】：

首先 - 你在用 opendir 做什么 - 我不会建议 glob 代替。

除此之外 - 还有另一个有用的技巧。为您的“单词”编译一个正则表达式。这很有用的原因是因为 - 在正则表达式中有一个变量，它每次都需要重新编译正则表达式 - 以防变量发生变化。如果它是静态的，那么您不再需要。

use strict;
use warnings;
use autodie;

my @words = ( "word1", "word2", "word3", "word4", "word5 word6" );
my $words_regex = join( "|", map ( quotemeta, @words  ));
$words_regex = qr/\b($words_regex)\b/;

open( my $output, ">", "/output/my_counts.csv" );

foreach my $file ( glob("/data/txt_files_for_counting") ) {
    open( my $input, "<", $file );
    my %count_of;
    while (<$input>) {
        foreach my $match (m/$words_regex/g) {
            $count_of{$match}++;
        }
    }
    print {$output} $file, "\n";
    foreach my $word (@words) {
        print {$output} $word, " => ", $count_of{$word} // 0, "\n"; 
    }
    close ( $input );
}

使用这种方法 - 您不再需要将整个文件“啜饮”到内存中来处理它。（这可能没有那么大的优势，具体取决于文件的大小）。

当输入以下数据时：

word1
word2
word3 word4 word5 word6 word2 word5 word4
word4 word5 word word 45 sdasdfasf
word5 word6 
sdfasdf
sadf

输出：

word1 => 1
word2 => 2
word3 => 1
word4 => 3
word5 word6 => 2

但是，我会注意 - 如果您的正则表达式中有重叠的子字符串，那么这将无法按原样工作 - 但有可能，您只需要一个不同的正则表达式。

【讨论】：

@Sobrique，两个后续问题。这是创建一个散列，其中每个键都是单词列表中的一个单词，对吗？这也会处理单词列表中的单词短语“word5 word6 word7”吗？
基本上，是的。它也会处理短语，因为它会“键入”正则表达式匹配。（虽然，你不能再使用qw :)） - 尽管由于正则表达式的工作方式，请记住，这最初不适用于重叠的短语。

【解决方案3】：

如果您有用空格分隔的单词，请使用collections.Counter dict 使用 python 来计算所有单词：

from collections import Counter

with open("in.txt") as f:
    counts = Counter(word for line in f for word in line.split())

然后按键访问以获取每个单词出现多少次你想要的任何单词：

 print(counts["foo"])
 print(count["bar"])
 .....

因此，只需遍历文件中的单词，您就可以获得所有单词的计数，因此如果您要计算 1 或 10000 个单词，您只需构建一次字典。与普通字典不同，您尝试访问的任何不在字典中的单词/键都不会引发 keyerror，而是将返回 0。

如果您只想使用一组存储某些单词来存储您想要保留的单词并为每个单词进行查找：

from collections import Counter
words = {"foo","bar","foobar"}
with open("out.txt") as f:
    counts = Counter(word for line in f for word in line.split() if word in words)

这只会以单词的形式存储单词的计数，集合查找平均为0(1)。

如果您想搜索一个短语，那么您可以使用 sum 和 in，但您必须为每个短语都这样做，以便多次遍历文件：

with open("in.txt") as f:
    count = sum("word1 word2 word3"  in line for line in f)

【讨论】：

酷！肯定会调查的。
@user1500158，如果你每行一个单词，你会使用counts = Counter(word.rstrip() for word in f
如果我有单词短语（例如，“word1 word2 word3”我正在寻找，这不会起作用，对吧？
@user1500158，这是可能的，但取决于文件的格式，如果您正在寻找短语，那么正则表达式可能更合适，您可以编译正则表达式以加快搜索速度

【解决方案4】：

您最大的瓶颈是从存储介质中读取数据的速度。 Using a small number of parallel processes，您的程序可能能够在处理其他文件的同时读取一个文件，从而加快处理速度。除非文件本身很大，否则这不太可能产生任何好处。

请记住，重叠字符串很难。下面的代码更喜欢最长的匹配。

非并行版本

#!/usr/bin/env perl

use strict;
use warnings;
use File::Spec::Functions qw( catfile );
use Text::CSV_XS;

die "Need directory and extension\n" unless @ARGV == 2;
my ($data_dir, $ext) = @ARGV;

my $pat = join('|',
    map quotemeta,
    sort { (length($b) <=> length($a)) }
    my @words = (
        'Visual Studio',
        'INCLUDE',
        'Visual',
    )
);

my $csv= Text::CSV_XS->new;

opendir my $dir, $data_dir
    or die "Cannot open directory: '$data_dir': $!";

my %wanted_words;

while (my $file = readdir $dir) {
    next unless $file =~ /[.]\Q$ext\E\z/;
    my $path = catfile($data_dir, $file);
    next unless -f $path;
    open my $fh, '<', $path
        or die "Cannot open '$path': $!";
    my $contents = do { local $/; <$fh> };
    close $fh
        or die "Cannot close '$path': $!";
    while ($contents =~ /($pat)/go) {
        $wanted_words{ $file }{ $1 } += 1;
    }
}

for my $file (sort keys %wanted_words) {
    my $file_counts = $wanted_words{ $file };
    my @fields = ($file, sort keys %$file_counts);
    $csv->combine(@fields)
        or die "Failed to combine [@fields]";
    print $csv->string, "\n";
}

为了进行测试，我在包含Boost 安装的一些临时批处理文件的目录中运行脚本：

C:\...\Temp> perl count.pl 。 cmdb2_msvc_14.0_vcvarsall_amd64.cmd，包括，“Visual Studio”
b2_msvc_14.0_vcvarsall_x86.cmd，包括，“Visual Studio”
b2_msvc_14.0_vcvarsall_x86_arm.cmd,INCLUDE,"Visual Studio"

也就是说，所有出现的"Visual" 都会被忽略，而使用"Visual Studio"。

要生成 CSV 输出，您应该使用 Text::CSV_XS 中的 combine 方法，而不是使用 join(',' ...)。

使用 Parallel::ForkManager 的版本

这是否能更快地完成任何事情取决于输入文件的大小和存储介质的速度。如果有改进，正确的进程数可能在 N/2 到 N 之间，其中 N 是内核数。我没有对此进行测试。

#!/usr/bin/env perl

use strict;
use warnings;
use File::Spec::Functions qw( catfile );
use Parallel::ForkManager;
use Text::CSV_XS;

die "Need number of processes, directory, and extension\n" unless @ARGV == 3;
my ($procs, $data_dir, $ext) = @ARGV;

my $pat = join('|',
    map quotemeta,
    sort { (length($b) <=> length($a)) }
    my @words = (
        'Visual Studio',
        'INCLUDE',
        'Visual',
    )
);

my $csv= Text::CSV_XS->new;

opendir my $dir, $data_dir
    or die "Cannot open directory: '$data_dir': $!";

my $fm = Parallel::ForkManager->new($procs);

ENTRY:
while (my $file = readdir $dir) {
    next unless $file =~ /[.]\Q$ext\E\z/;
    my $path = catfile($data_dir, $file);
    next unless -f $path;
    my $pid = $fm->start and next ENTRY;

    my %wanted_words;
    open my $fh, '<', $path
        or die "Cannot open '$path': $!";
    my $contents = do { local $/; <$fh> };
    close $fh
        or die "Cannot close '$path': $!";
    while ($contents =~ /($pat)/go) {
        $wanted_words{ $1 } += 1;
    }
    my @fields = ($file, sort keys %wanted_words);
    $csv->combine(@fields)
        or die "Failed to combine [@fields]";
    print $csv->string, "\n";
    $fm->finish;
}

$fm->wait_all_children;

【讨论】：

【解决方案5】：

我宁愿使用单线：

$ for file in /data/txt_files_for_counting/*; do perl -F'/\W+/' -nale 'BEGIN { @w = qw(word1 word2 word3 word4) } $h{$_}++ for map { $w = lc $_; grep { $_ eq $w } @w } @F; END { print join ",", $ARGV, map { $h{$_} || 0 } @w; }' "$file"; done

【讨论】：

我不会 - 这正是让 perl 作为一种只写语言的不良代表。
为什么你更愿意写这个？
为了可移植性：首先，我不希望脚本文件中涉及这些输入文件路径和搜索词。他们应该从外面通过。其次，计算单词似乎不是我的最终目标。该脚本的输出可以与其他文件进行比较。