Perl Regex 匹配包含多个单词的行答案

【问题标题】：Perl Regex match lines that contain multiple wordsPerl Regex 匹配包含多个单词的行
【发布时间】：2011-01-21 09:14:38
【问题描述】：

我正在尝试开发一种相当快速的全文搜索。它将读取索引，并且理想情况下应该只在一个正则表达式中运行匹配。

因此，我需要一个仅在包含某些单词时才匹配行的正则表达式。

例如对于

my $txt="one two three four five\n".
        "two three four\n".
        "this is just a one two three test\n";

只有第一行和第三行应该匹配，因为第二行不包含单词“one”。

现在我可以在一段时间内遍历每一行（）或使用多个正则表达式，但我需要快速解决方案。

这里的例子： http://www.regular-expressions.info/completelines.html （“查找包含或不包含某些单词的行”）

是我需要的。但是，我无法让它在 Perl 中工作。我尝试了很多，但它没有任何结果。

my $txt="one two three four five\ntwo three four\nthis is just a one two three test\n";
my @matches=($txt=~/^(?=.*?\bone\b)(?=.*?\btwo\b)(?=.*?\bthree\b).*$/gi);
print join("\n",@matches);

没有输出。

总结：我需要一个正则表达式来匹配包含多个单词的行，并返回这些整行。

提前感谢您的帮助！我尝试了很多，但就是无法正常工作。

【问题讨论】：

标签： regex perl

【解决方案1】：

默认情况下，^ 和 $ 元字符仅匹配输入的开始和结束。要让它们匹配行的开头和结尾，请启用 m (MULTI-LINE) 标志：

my $txt="one two three four five\ntwo three four\nthis is just a one two three test\n";
my @matches=($txt=~/^(?=.*?\bone\b)(?=.*?\btwo\b)(?=.*?\bthree\b).*$/gim);
print join("\n",@matches);

产生：

one two three four five
this is just a one two three test

但是，如果你真的想要快速搜索，如果你问我，正则表达式（有很多前瞻）不是要走的路。

【讨论】：

啊，你是对的。我没有认真阅读这个问题。

【解决方案2】：

代码：

use 5.012;
use Benchmark qw(cmpthese);
use Data::Dump;
use once;

our $str = <<STR;
one thing
another two
three to go
no war
alone in the dark
war never changes
STR

our @words = qw(one war two);

cmpthese(100000, {
    'regexp with o'             => sub {
        my @m;
        my $words = join '|', @words;
        @m = $str =~ /(?!.*?\b(?:$words)\b)^(.*)$/omg;
        ONCE { say 'regexp with o:'; dd @m }
    },
    'regexp'                    => sub {
        my @m;
        @m = $str =~ /(?!.*?\b(?:@{ [ join '|', @words ] })\b)^(.*)$/mg;
        ONCE { say 'regexp:'; dd @m }
    },
    'while'                     => sub {
        my @m;
        @m = grep $_ !~ /\b(?:@{ [ join '|',@words ] })\b/,(split /\n/,$str);
        ONCE { say 'while:'; dd @m }
    },
    'while with o'              => sub {
        my @m;
        my $words = join '|',@words;
        @m = grep $_ !~ /\b(?:$words)\b/o,(split /\n/,$str);
        ONCE { say 'while with o:'; dd @m }
    }
})

结果：

regexp:
("three to go", "alone in the dark")
regexp with o:
("three to go", "alone in the dark")
while:
("three to go", "alone in the dark")
while with o:
("three to go", "alone in the dark")
                 Rate        regexp regexp with o         while  while with o
regexp        19736/s            --           -2%          -40%          -60%
regexp with o 20133/s            2%            --          -38%          -59%
while         32733/s           66%           63%            --          -33%
while with o  48948/s          148%          143%           50%            --

Сonclusion

因此，带有 while 的变体比带有 regexp 的变体更快。``

【讨论】：

您好，谢谢！理论上这是对的，但您的解决方案似乎没有捕获整行，而只捕获匹配的单词。捕获整行时，RegExp 大约快 5%。无论如何，非常感谢你，因为我从你的代码中学到了很多东西！
我的解决方案捕获了整行，看起来更关注我的代码。您可以在结果中看到它。
对不起，你是对的。我把它弄混了。但是，您的查询与单词不匹配。不幸的是，我并不真正理解代码，否则我会尝试更改它。查询应该只匹配包含 all 单词的项目（例如，一行“whatever onewhatwar twowhatever”）。即使我一开始答错了，你能不能改变一下？否则它看起来很有希望，在实际文件中，虽然速度仍然提高了 75%。
Array @words 也可以包含短语，并且可以在不更改此代码的情况下正常工作。将您不理解的代码块写在“私人”中，我会尽力帮助您。