通过 awk 离散到连续数字范围答案

【问题标题】：Discrete to continuous number ranges via awk通过 awk 离散到连续数字范围
【发布时间】：2017-06-26 15:35:05
【问题描述】：

假设一个文本文件file 包含多个离散数字范围，每行一个。每个范围前面都有一个字符串（即范围名称）。每个范围的下限和上限由破折号分隔。每个数字范围后跟一个分号。各个范围已排序（即范围 101-297 在 1299-1301 之前）并且不重叠。

$cat file
foo  101-297;
bar  1299-1301;
baz  1314-5266;

请注意，在上面的示例中，三个范围不构成从整数 1 开始的连续范围。

我相信 awk 是填补缺失数字范围的合适工具，这样所有范围合起来形成从 {1} 到 {最后一个范围的上限}的连续范围。如果是这样，您将使用什么 awk 命令/函数来执行任务？

$cat file | sought_awk_command
new1 1-100;
foo  101-297;
new2 298-1298;
bar  1299-1301;
new3 1302-1313;
baz  1314-5266;

编辑 1：经过仔细评估，下面建议的代码在另一个简单示例中失败。

$cat example2
foo  101-297;
bar  1299-1301;
baz  1302-1314; # Notice that ranges "bar" and "baz" are continuous to one another
qux  1399-5266;

$ awk -F'[ -]' '$3-Q>1{print "new"++o,Q+1"-"$3-1";";Q=$4} 1' example2
new1 1-100;
foo  101-297;
new2 298-1298;
bar  1299-1301;
baz  1302-1314;
new3 1302-1398; # ERROR HERE: Notice that range "new3" has a lower bound that is equal to upper bound of "bar", not of "baz".
qux  1399-5266;

编辑 2： 非常感谢 RavinderSingh13 帮助解决这个问题。但是，建议的代码仍会生成与给定目标不一致的输出。

$ cat example3
foo  35025-35144;
bar  35259-35375;
baz  35376-35624;
qux  37911-39434;

$ awk -F'[ -]' '$3-Q+0>=1{print "new"++o,Q+1"-"$3-1";";Q=$4} {Q=$4;print}' example3
new1 1-35024;
foo  35025-35144;
new2 35145-35258;
bar  35259-35375;
new3 35376-35375; # ERROR HERE: Notice that range "new3" has been added, even though ranges "bar" and "baz" are contiguous.
baz  35376-35624;
new4 35625-37910;
qux  37911-39434;

【问题讨论】：

您说“各个范围已排序（即，范围 101-297 在 1299-1301 之前）并且不重叠。”但随后发布了example2，它们确实重叠（bar 1299-1301; 和baz 1301-1314; 在1301 重叠）。它们是重叠还是不重叠？另外 - awk 不是 bash 工具，它是一个完全独立的工具，可用于所有标准 UNIX 安装和一些其他操作系统。
@EdMorton 范围不重叠。 example2 包含错误。我的错。它应该是bar 1299-1301 和baz 1302-1314。我已经相应地纠正了这个例子。关于bash 作为一个 UNIX（而不是专门的 bash）工具也采取了这一点。
没问题，我发布的脚本可以正常工作。
数据的期望输出是什么：1-100。 102-200，即。当范围之间只有一个值时？

标签： bash awk integer range continuous

【解决方案1】：

尝试：

awk -F'[ -]' '$3-Q>1{print "new"++o,Q+1"-"$3-1";";Q=$4} 1'   Input_file

编辑：现在也添加一个非单一的衬垫解决方案，并提供适当的解释。

awk -F'[ -]' '                                        ###Setting field separator as space, dash here.
                $3-Q>1{                               ###Checking here if 3rd field and variable Qs subtraction is greater than 1, if yes then perform following.
                        print "new"++o,Q+1"-"$3-1";"; ###printing the string new with a incrementing value of variable o each time, then variable Qs value with adding 1 to it, then current line $4-1 and semi colon.
                        Q=$4                          ###Assigning the variable Q value to 4th field of the current line here too.
                      }
                1                                     ###printing the current line here.
             ' Input_file                             ###Mentioning the Input_file here too.

EDIT2：根据 OP 的条件再添加一个答案。

 awk -F'[ -]' '$3-Q+0>=1{print "new"++o,Q+1"-"$3-1";";Q=$4} {Q=$4;print}'   Input_file

【讨论】：

很好的解释！但是，每个插入范围的上限（即 newX）应该比后续范围的下限（不是上限！）小一。
对此表示歉意，现在进行了相应的更改。
哇！你肯定知道 awk！
谢谢，你也可以给我一个 +ve 来鼓励我回答 :) 乐于提供帮助，不断学习，不断发布问题/答案。
可能有超过 2 个空格 - 在这种情况下，$4 将为空。我会使用-F'[[:space:]]*|-'

【解决方案2】：

这对于可以重叠的范围没有问题，如您在原始示例 2 中所示，bar 1299-1301; 和 baz 1301-1314; 在 1301 重叠。

$ cat tst.awk
{ split($2,curr,/[-;]/); currStart=curr[1]; currEnd=curr[2] }
currStart > (prevEnd+1) { print "new"++cnt, prevEnd+1 "-" currStart-1 ";" }
{ print; prevEnd=currEnd }

$ awk -f tst.awk file
new1 1-100;
foo  101-297;
new2 298-1298;
bar  1299-1301;
new3 1302-1313;
baz  1314-5266;

$ awk -f tst.awk example2
new1 1-100;
foo  101-297;
new2 298-1298;
bar  1299-1301;
baz  1301-1314;
new3 1315-1398;
qux  1399-5266;

$ awk -f tst.awk example3
new1 1-35024;
foo  35025-35144;
new2 35145-35258;
bar  35259-35375;
baz  35376-35624;
new3 35625-37910;
qux  37911-39434;

【讨论】：

感谢您提供此代码替代方案。当我试图避免工作目录中的任何额外文件（即tst.awk）时，如何应用您的三行代码而不必将其保存到文件并调用-f 标志？
一般awk 'script' file 而不是awk -f script_in_file file 所以在这种情况下awk '{ split($2,curr,/[-;]/); currStart=curr[1]; currEnd=curr[2] } currStart > (prevEnd+1) { print "new"++cnt, prevEnd+1 "-" currStart-1 ";" } { print; prevEnd=currEnd }' file。您可能会受益于简单地在不同的目录中保存和执行它，这样您就可以在以后重复使用和构建它，例如awk -f "$HOME/bin/tst.awk" file。如果出于某种原因您更喜欢简洁而不是清晰，您当然可以将所有变量名称设为单个字母。

【解决方案3】：

$ cat file1
foo  2-100
bar  102-200
$ awk F' +|[-;}' 'p+1<$2{print "new" ++q, p+1 "-" $2-1 ";"}p=$3' file1
new1 1-1;
foo  2-100
new2 101-101;
bar  102-200
$ cat file2
foo  101-297;
bar  1299-1301;
baz  1314-5266;
$ awk -F' +|[-;]' 'p+1<$2{print "new" ++q, p+1 "-" $2-1 ";"}p=$3' file2
new1 1-100;
foo  101-297;
new2 298-1298;
bar  1299-1301;
new3 1302-1313;
baz  1314-5266;

解释：

$ awk -F' +|[-;]' '                   # FS is ; - or a bunch of spaces
p+1 < $2 {                           # if p revious $3+1 is still less than new $2
    print "new"++q,p+1 "-" $2-1 ";"  # print a "new" line
}
p=$3                                 # set future p and implicit print of record *
' file2                              # * as all values are above 0

【讨论】：

我认为awk -F '[+;-]' … 比您的字段分隔符稍快，因为字符类比交替更快（请注意，破折号必须放在最后，以防止表示一系列字符）。我认为它还解决了您代码中的拼写错误（除非您打算在 `+` 上进行分隔，而不是像 + 这样没有前导空格）。
@AdamKatz 我确实做到了：FS is ; - or a bunch of spaces
啊，我现在明白了。在这种情况下，-F' +|[;-]' 会更好。