Perl one liner 提取多行模式答案

【问题标题】：Perl one liner to extract a multi-line patternPerl one liner 提取多行模式
【发布时间】：2012-08-01 08:09:00
【问题描述】：

我在文件中有一个模式如下，它可以/不能跨越多行：

 abcd25
 ef_gh
 ( fg*_h
 hj_b*
 hj ) {

我尝试过的：

perl -nle '打印而 m/^\s*(\w+)\s+(\w+?)\s*(([\w-0-9,* \s]))\s {/gm'

我不知道这里的标志是什么意思，但我所做的只是为模式写了一个regex 并将其插入到模式空间中。如果模式在一行中，这匹配得很好：

abcd25 ef_gh ( fg*_h hj_b* hj ) {

但仅在多行情况下失败！

我昨天开始使用 perl，但语法太混乱了。因此，正如我们的一位 SO 伙伴所建议的那样，我写了一个 regex 并将其插入到他提供的代码中。

我希望perl 和尚可以在这种情况下帮助我。欢迎替代解决方案。

输入文件：

 abcd25
 ef_gh
 ( fg*_h
 hj_b*
 hj ) {

 abcd25
 ef_gh
 fg*_h
 hj_b*
 hj ) {

 jhijdsiokdù ()lmolmlxjk;
 abcd25 ef_gh ( fg*_h hj_b* hj ) {

预期输出：

 abcd25
 ef_gh
 ( fg*_h
 hj_b*
 hj ) {
 abcd25 ef_gh ( fg*_h hj_b* hj ) {

输入文件可以有多个与所需模式的开始和结束模式一致的模式。提前感谢您的回复。

【问题讨论】：

预期的输出是什么？在这两种情况下，输出都是空的......
@pavel 可悲的是空的！ :(我添加了预期的输出:)
您的主要问题是您的要求不明确。您应该从准确指定匹配条件开始...
@pavel 我同意你的看法，我会根据具体要求进行更新。

标签： perl bash sed awk perl-module

【解决方案1】：

正则表达式甚至不匹配单行。你认为双括号有什么作用？

你可能想要

m/^\s*(\w+)\s+(\w+?)\s*\([\w0-9,*\s]+\)\s{/gm

更新：规范已更改。正则表达式（几乎）没有，但您必须稍微更改代码：

perl -0777 -nle 'print "$1\n" while m/^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{)/gm'

另一个更新：

解释：

开关在perlrun中描述：zero、n、l、e

正则表达式可以由YAPE::Regex::Explain自动解释

perl -MYAPE::Regex::Explain -e 'print YAPE::Regex::Explain->new(qr/^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{)/)->explain'
The regular expression:

(?-imsx:^\s*(\w+\s+\w+?\s*\([\w0-9,*\s]+\)\s{))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  ^                        the beginning of the string
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \s+                      whitespace (\n, \r, \t, \f, and " ") (1
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \w+?                     word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the least amount
                             possible))
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \(                       '('
----------------------------------------------------------------------
    [\w0-9,*\s]+             any character of: word characters (a-z,
                             A-Z, 0-9, _), '0' to '9', ',', '*',
                             whitespace (\n, \r, \t, \f, and " ") (1
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \)                       ')'
----------------------------------------------------------------------
    \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
    {                        '{'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

/gm 开关在perlre 中进行了解释

【讨论】：

我不确定双括号的作用：（我通过模拟器编写了正则表达式；）
现在单行匹配没问题，但仍然卡在多行！
@Geekasaur：上述模式也适用于多行输入！
@pavel 谢谢，确实如此 :) 您能否为使用的标志添加简短描述以及 perl 在这种情况下的作用？

【解决方案2】：

将触发器运算符用于单线

Perl 使用触发器运算符使这变得非常容易，它允许您打印出两个正则表达式之间的所有行。例如：

$ perl -ne 'print if /^abcd25/ ... /\bhj \) {/' /tmp/foo
abcd25
ef_gh
( fg*_h
hj_b*
hj ) {

但是，像这样的简单单行不会区分您想要拒绝分隔模式之间的特定匹配的匹配。这需要更复杂的方法。

更复杂的比较受益于条件分支

单行并不总是最好的选择，如果正则表达式变得过于复杂，它们很快就会失控。在这种情况下，您最好编写一个可以使用条件分支的实际程序，而不是尝试使用过于聪明的正则表达式匹配。

这样做的一种方法是使用 simple 模式建立匹配，然后拒绝与某些 other 简单模式不匹配的任何匹配。例如：

#!/usr/bin/perl -nw

# Use flip-flop operator to select matches.
if (/^abcd25/ ... /\bhj \) {/) {
    push @string, $_
};

# Reject multi-line patterns that don't include a particular expression
# between flip-flop delimiters. For example, "( fg" will match, while
# "^fg" won't.
if (/\bhj \) {/) {
    $string = join("", @string);
    undef @string;
    push(@matches, $string) if $string =~ /\( fg/;
};

END {print @matches}

当针对 OP 的更新语料库运行时，这会正确产生：

abcd25
ef_gh
( fg*_h
hj_b*
hj ) {
abcd25 ef_gh ( fg*_h hj_b* hj ) {

【讨论】：

是的，但这会干扰文件中的其他模式。
@Geekasaur 抱歉，但这与您的语料库和您的预期输出完全匹配，正如您问题中当前定义的那样。如果您有其他和/或附加要求，请更新您的问题。
gnome：抱歉没有具体说明。我将更新问题以传达更好的想法。
@Geekasaur 如果你把行首改成词首，perl -ne 'print if /^abcd25/ ... /\bhj \) {/' /tmp/foo 怎么不做你想做的事？
是的，它确实提取了模式，但也引发了不需要的匹配！可能如果您在代码中添加简短描述，我可以稍微调整一下我的正则表达式，匹配的结束模式是否应该在行首。