嵌套多个条件的bash文本解析答案

【问题标题】：bash text parsing with multiple conditions nested嵌套多个条件的bash文本解析
【发布时间】：2020-02-03 10:29:20
【问题描述】：

我有以下代码检查超过 10 个单词的行并将它们拆分到第一个逗号字符出现的位置。它重申了这个过程，因此所有超过 10 个单词和逗号的新拆分行也被拆分（最后没有超过 10 个单词和逗号的行）。

如何编辑此代码以执行以下操作：在完成所有逗号拆分之后（当前代码已经执行的操作），检查结果行是否超过 10 个单词并拆分第一个“和”（带空格）出现？

#!/usr/bin/env bash

input=input.txt
temp=$(mktemp ${input}.XXXX)
trap "rm -f $temp" 0

while awk '
  BEGIN { retval=1 }
  NF >= 10 && /, / {
    sub(/, /, ","ORS)
    retval=0
  }
  1
  END { exit retval }
' "$input" > "$temp"; do
  mv -v $temp $input
done

输入样本：

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10 Word11

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10, Word11 Word12 Word13 Word14 Word15 Word16 

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10 Word11 and Word12 Word13 Word14 Word15 

Word1 Word2 Word3 Word4 and Word5

期望的输出：

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9

Word1 Word2 Word3 Word4, 
Word5 Word6 Word7 Word8 Word9 Word10 Word11

Word1 Word2 Word3 Word4,
 Word5 Word6 Word7 Word8 Word9 Word10,
 Word11 Word12 Word13 Word14 Word15 Word16 

Word1 Word2 Word3 Word4, 
Word5 Word6 Word7 Word8 Word9 Word10 Word11 and
 Word12 Word13 Word14 Word15 

Word1 Word2 Word3 Word4 and Word5

提前谢谢你！

【问题讨论】：

请显示输入数据和预期输出。
Yuji，我编辑显示输入和输出数据的样本。谢谢。

标签： bash parsing text nested multiple-conditions

【解决方案1】：

请尝试以下方法：

awk '{
    while (split($0, a, "( +and +)|( +)") > 10 && match($0, "( +and +)|,")) {
        if (match($0, "[^,]+,")) {
            # puts a newline after the 1st comma
            print substr($0, 1, RLENGTH)
            $0 = substr($0, RLENGTH + 1)
        } else {
            # puts a newline before the 1st substring " and "
            n = split($0, a, " +and +")
            if (a[1] == "") {               # $0 starts with " and "
                a[1] = " and " a[2]
                for (i = 2; i < n; i++) {
                    a[i] = a[i+1]
                }
                n--
            }
            print a[1]
            $0 = " and " a[2]
            for (i = 3; i <= n; i++) {      # there are two ore more " and "
                $0 = $0 " and " a[i]
            }
        }
    }
    print
}' input.txt

给定输入的输出：

Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9

Word1 Word2 Word3 Word4,
 Word5 Word6 Word7 Word8 Word9 Word10 Word11

Word1 Word2 Word3 Word4,
 Word5 Word6 Word7 Word8 Word9 Word10,
 Word11 Word12 Word13 Word14 Word15 Word16

Word1 Word2 Word3 Word4,
 Word5 Word6 Word7 Word8 Word9 Word10 Word11
 and Word12 Word13 Word14 Word15

Word1 Word2 Word3 Word4 and Word5

[解释]

它在同一记录上迭代，而模式空间包含超过 10 个字段（不包括单词“and”） && 模式空间包括行分隔符以启用连续拆分。
如果模式空间包含逗号，则打印左手并用右手更新模式空间。
如果模式空间包含单词“和”，则处理有点困难，因为这个词保留在更新的模式空间中。我的方法在某种意义上可能并不优雅，但即使有记录它也有效包含多个（两个或多个）" 和 "s。

[编辑]

如果您想将单词 and 作为字数的一部分，请替换第 2 行：

while (split($0, a, "( +and +)|( +)") > 10 && match($0, "( +and +)|,")) {

与：

while (NF > 10 && match($0, "( +and +)|,")) {

此外，如果您允许单词 and 跟随原行：脚本会稍微简化为：

awk '{
    while (NF > 10 && match($0, "( +and +)|,")) {
        if (match($0, "[^,]+,")) {
            # puts a newline after the 1st comma
            print substr($0, 1, RLENGTH)
            $0 = substr($0, RLENGTH + 1)
        } else {
            # puts a newline after the 1st substring " and "
            n = split($0, a, " +and +")
            print a[1] " and"
            $0 = " " a[2]
            for (i = 3; i <= n; i++) {      # there are two ore more " and "
                $0 = $0 " and " a[i]
            }
        }
    }
    print
}' input.txt

此外，如果Perl 是您的选择，您可以说：

perl -ne '{
    while (split > 10 && /( +and +)|,/) {
        if (/^.*?(, *| +and +)/) {
            print $&, "\n";
            $_ = " $'\''";
        }
    }
    print
}' input.txt

希望这会有所帮助。

【讨论】：

感谢您的解决方案和解释。对“和”这个词有一点误解；我实际上希望它也算作字数的一部分。此外，在您提到它之前，我没有意识到在新行中有 and 会导致问题，因为它也是搜索条件的一部分。我本来可以在原行中使用“和”。紫外线
@HenryM 感谢您测试我的答案。我已根据您的评论更新了脚本。 BR。
是的，这很棒。我将对其进行分析，以更好地了解您是如何解决问题的。谢谢！

【解决方案2】：

这是您的预期答案吗？

echo "Word1 Word2 Word3 Word4, Word5 Word6 Word7 Word8 Word9 Word10, Word11 Word12 Word13 Word14 Word15 Word16 Word17 Word18 Word19 Word20 Word21 and Word22 Word23 Word24." | grep -oE '[a-zA-Z0-9,.]+' | awk '
BEGIN {
    cnt = 0
}
{
    str = str " " $0
    if ($0 ~ /,$/){
        print str
        cnt = 0
        str = ""
    }
    else if (cnt < 10){
        cnt++
    }
    else {
        print str
        cnt = 0
        str = ""
    }
} END {
    print str
}' | sed 's/^ *//'

Word1 Word2 Word3 Word4,
Word5 Word6 Word7 Word8 Word9 Word10,
Word11 Word12 Word13 Word14 Word15 Word16 Word17 Word18 Word19 Word20 Word21
and Word22 Word23 Word24.

【讨论】：

否，因为即使对于少于 10 个单词的句子，这也会拆分逗号。只有当一行超过10个单词时，我希望它用逗号分隔，然后如果剩下的句子仍然超过10个单词，我希望它被“and”这个词分割。
请查看您的“期望输出”。