Bash 中的多个多行正则表达式匹配答案

【问题标题】：Multiple multi-line regex matches in BashBash 中的多个多行正则表达式匹配
【发布时间】：2023-03-28 11:53:01
【问题描述】：

我正在尝试在 bash 脚本中进行一些相当简单的字符串解析。基本上，我有一个由多个 multi-line 字段组成的文件。每个字段都被一个已知的页眉和页脚包围。

我想将每个字段分别提取成一个数组或类似的，像这样

>FILE=`cat file`
>REGEX="@#@#@#[\s\S]+?@#@#@"
> 
>if [[$FILE =~ $REGEX ]] then
>   echo $BASH_REMATCH
>fi

文件：

@#@#@#################################
this is field one
@#@#@#
@#@#@#################################
this is field two
they can be any number of lines
@#@#@#

现在我很确定问题在于 bash 不匹配换行符和“。”

我可以将它与“pcregrep -M”匹配，但当然整个文件都会匹配。我可以从 pcregrep 一次获得一场比赛吗？

我不反对使用一些内联 perl 或类似的东西。

【问题讨论】：

标签： regex bash

【解决方案1】：

如果你有傻瓜

awk 'BEGIN{ RS="@#*#" }
NF{
    gsub("\n"," ") #remove this is you want to retain new lines
    print "-->"$0 
    # put to array
    arr[++d]=$0
} ' file

输出

$ ./shell.sh
--> this is field one
--> this is field two they can be any number of lines

【讨论】：

稍作修改以做我想做的事。 awk 是我从未学过的东西。谢谢！

【解决方案2】：

TXR 语言执行整个文档的多行匹配，绑定变量，并且（使用-B“转储绑定”选项）发出可以被eval-ed 正确转义的 shell 变量赋值。支持数组。

@ 字符很特殊，所以它必须加倍才能匹配字面意思。

$ cat fields.txr
@(collect)
@@#@@#@@#################################
@  (collect)
@field
@  (until)
@@#@@#@@#
@  (end)
@  (cat field)@# <- catenate the fields together with a space separator by default
@(end)

$ txr -B fields.txr data
field[0]="this is field one"
field[1]="this is field two they can be any number of lines"

$ eval $(txr -B fields.txr data)
$ echo ${field[0]}
this is field one
$ echo ${field[1]}
this is field two they can be any number of lines

@field 语法匹配整行。这些被收集到一个列表中，因为它位于 @(collect) 中，并且这些列表被收集到列表中，因为它嵌套在另一个 @(collect) 中。然而，内部 @(cat field) 将内部列表缩减为单个字符串，因此我们最终得到一个字符串列表。

这是“经典 TXR”：最初是如何设计和使用的，由这个想法引发：

我们为什么不让 here-documents 反向工作并将大量文本解析为变量？

默认情况下，这种隐式发射匹配变量（默认情况下在 shell 语法中）仍然是受支持的行为，即使语言变得更加强大，因此与 shell 脚本集成的需要更少。

【讨论】：

【解决方案3】：

我会围绕awk 构建一些东西。这是第一个概念证明：

awk '
    BEGIN{ f=0; fi="" }
    /^@#@#@#################################$/{ f=1 }
    /^@#@#@#$/{ f=0; print"Field:"fi; fi="" }
    { if(f==2)fi=fi"-"$0; if(f==1)f++ }
' file

【讨论】：

【解决方案4】：

begin="@#@#@#################################"
end="@#@#@#"
i=0
flag=0

while read -r line
do
    case $line in
        $begin)
            flag=1;;
        $end)
            ((i++))
            flag=0;;
        *)
            if [[ $flag == 1 ]]
            then
                array[i]+="$line"$'\n'    # retain the newline
            fi;;
     esac
done < datafile

如果您想在数组元素中保留标记线，请将赋值语句（及其标志测试）移动到 while 循环的顶部 case 之前。

【讨论】：