Perl 选择性地分割空格答案

【问题标题】：Perl split on spaces selectivelyPerl 选择性地分割空格
【发布时间】：2020-08-01 06:32:38
【问题描述】：

我正在尝试在 perl 中的元素之间的空格上拆分字符串。但是，每个元素也可能包含空格（通过双引号或括在括号内）。

例如，一个字符串包含：

for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE

我想得到一个像(hydrogen, helium, "carbon 14", "$(some stuff "here")", FILE)这样的数组

我可以处理for element in 位并将其余部分作为一个字符串。我试过做

@elements = split /(?<=\"[^\"]*\")\s+(?=\"[^\"]*\")/, $list

虽然正则表达式只匹配引号之间的空格（在 regexr.com 上检查），但 perl 程序给了我Lookbehind longer than 255 not implemented in regex。

是否有更好的方法在空格上使用split 来考虑这一点？我的正则表达式有什么问题？

【问题讨论】：

我们是否需要担心括号内包含) 字符的东西？（如$(blah fyvg "fhgh)" fyyh)）

标签： perl split whitespace lookbehind

【解决方案1】：

匹配带引号或括号的表达式，然后与非空格序列交替

my @elems = $string =~ / ( "[^"]+" | \S*\( [^)]+ \)\S* | \S+ ) /gx;

用您的字符串和一些简单的变体进行测试。

这假定两个分隔符都没有嵌套：连续引号之间的表达式作为一个元素（即使它有括号括起来的子表达式）作为一个元素，括号内的一个元素也是如此（即使它有引用的段）。这是从问题中推断出来的。

我允许括号前后的非空格字符序列，以容纳前面的$。调整一下，如果它确实只有是前面的一美元。

【讨论】：

【解决方案2】：

在这些情况下，我会采用解析方法。这样你就不必想出一个可以做几件不同事情的正则表达式。这很重要，因为字符串的复杂性会发生变化。尽管这看起来像更多的代码，但它是基本的 Perl，您可以将它放在一个子例程中。我可以轻松地添加另一种标记类型，而不会干扰代码的机制或重写模式。我也在How do I grab an unknown number of captures from a pattern?中使用了这个技巧：

use v5.10;

my $string = 'for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE';

# The types of things you can match, going from most specific
# to least specific. Now you only need to describe what each
# individual thing looks like. Each pattern is responsible for
# the capture group $1, which is the thing we'll save.
my @patterns = (
    qr/ ( \$\( .+? \) ) /x,
    qr/ ( " .+? " )     /x,
    qr/ ( \S+ )         /x,
    );

my @tokens;
# The magic is global matching in scalar context,
# using /g. The \G anchor starts matching at the
# last position you matched in the prior match of
# the same string (that's in pos()). Normally that
# position is reset when a match fails, but /c
# prevents that so you can try other patterns. Once
# you match a pattern, save what you matched and
# move on.
#
# The pattern here also takes care of trailing whitespace.
while( pos($string) < length($string) ) {
    foreach my $pattern ( @patterns ) {
        next unless $string =~ m/ \G $pattern \s*/gcx;
        push @tokens, $1;
        last;
        }
    }

use Data::Dumper;
say Dumper( \@tokens );

您可以对 branch reset operator 执行很多相同操作，因为交替捕获的每个捕获是 $1：

use v5.10;

my $string = 'for element in hydrogen   helium  "carbon  14"    $(some stuff "here")   FILE';

my @tokens = $string =~ m/
    (?|
        (?: ( \$ \( .+? \) ) ) |
        (?: ( " .+? "      ) ) |
        (?: ( \S+          ) )
    )
    /gx;

use Data::Dumper;
say Dumper( \@tokens );

这些比zdim's answer 稍微复杂一些，但更灵活。例如，假设您决定不希望 "carbon 14" 周围出现引号。这是一个非常容易解决的问题，因为正则表达式的结构不会改变。您只需更改处理该令牌的子模式：

    (?|
        (?:   ( \$ \( .+? \) )   ) |
        (?: " ( .+?          ) " ) |
        (?:   ( \S+          )   )
    )

您可能不需要这种额外的灵活性。我通常会发现我在这类任务中会遇到其他奇怪的情况，所以我从灵活的解决方案开始。多做几次就没什么大不了了。

至于你的错误，你得到了：

在正则表达式中未实现超过 255 的 Lookbehind。

在 v5.30 之前，您不能拥有 variable-width lookbehind。现在它是一个实验性功能，但模式必须事先知道它不会超过 255 个字符。您的模式有(?<=\"[^\"]*\")，而* 为零或更多。更多可能大于 255，因此这是非法模式。

regexr.com 使用 PCRE，它曾经代表“Perl Compatible”，但它们已经发生了足够大的分歧，以至于一些看起来可以正常工作的东西在其他语言中可能很好，但在 Perl 中却不行。这通常不是问题，但向后看是区别之一。

【讨论】：