AWK：根据两列信息过滤数据答案

【问题标题】：AWK: filtering of the data based on TWO column informationAWK：根据两列信息过滤数据
【发布时间】：2021-07-03 07:55:14
【问题描述】：

我正在对以多列格式排列的多列 CSV 进行后处理：

ID, POP, dG
1, 10, -5.6200
2, 4, -5.4900
3, 1, -5.3000
4, 4, -5.1600
5, 4, -4.8800
6, 3, -4.7600
7, 2, -4.4900
8, 5, -4.4500
9, 2, -4.4400
10, 8, -4.1400
11, 1, -4.1200
12, 2, -4.0900
13, 5, -4.0100
14, 1, -3.9500
15, 3, -3.9200
16, 10, -3.8800
17, 1, -3.8700
18, 3, -3.8300
19, 1, -3.8200
20, 3, -3.8000

之前我使用以下 AWK 解决方案两次处理 inout 日志，检测 pop(MAX) 并保存匹配 $2 > (.8 * max)' 的 linnes：

awk -F ', ' 'NR == 1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR+1} next} FNR <= 2 || (FNR == n && $2 > (.4*max)) || $2 > (.8 * max)' input.csv{,} > output.csv

这可以减少输入日志，只保留两个 POP 最高的 linnes：

ID, POP, dG
1, 10, -5.6200
16, 10, -3.8800

现在我需要更改搜索算法，同时考虑到第 2 (POP) 和第 3(dG) 列：i) 始终以第一行作为参考，它始终在第 3 列 (dG) 中具有最大的负数); ii) 在第二列中找到最大数字的行，pop(MAX)； iii) 取 (i) 和 (ii) 之间的所有 linnes，它们将匹配适用于 BOTH 列的以下规则： a) 行在第 3 列中应有（负）数字，符合以下规则：$1 > (.5 * $1(min))'，其中 $1(min) 是第一行的数字 (dG)（始终为负数） ) b) 另外行应该与第二列的旧规则相匹配，阈值降低：$2 = or > (.5 * max)'，其中 max 是 pop(MAX)

所以预期的输出应该是

ID, POP, dG
1, 10, -5.6200.  # this is the first line with most negative dG
8, 5, -4.4500   # this has POP (5) and dG (-4.4500) matching the both rules
10, 8, -4.1400. # this has POP (8) and dG (-4.1400) matching the both rules    
16, 10, -3.8800  # this is pop max, with higher POP

添加 8-04：

如果第一行的 POP 非常低（不符合规则 $2 >= (.5 * maxPop)

ID, POP, dG
1, 5, -5.5600
2, 7, -5.3300
3, 7, -5.1900
4, 1, -4.6800
5, 1, -4.5800
6, 5, -4.5600
7, 3, -4.4700
8, 4, -4.4300
9, 9, -4.4200
10, 4, -4.4200
11, 2, -4.3800
12, 4, -4.3400
13, 25, -4.3000
14, 6, -4.2900
15, 8, -4.2600
16, 3, -4.2300
17, 1, -4.1800
18, 3, -4.1300
19, 1, -4.1300
20, 1, -4.1200
21, 27, -4.0800
22, 2, -4.0300

输出不应包含第一行，同时仍使用 dG 列中的值作为第二个条件 ($3

13, 25, -4.3000
21, 27, -4.0800

【问题讨论】：

但是POP(5)不满足这个条件$2 > (.5 * maxPop)那怎么会在输出中呢？
ops 实际上应该是 $2 >= (.5 * maxPop) # 等于或更大
好的，那么13, 5, -4.0100 也应该在输出中？
正确，因为它匹配两列的规则。我要编辑它

标签： csv awk

【解决方案1】：

您可以使用这个awk 解决方案：

awk -F ', ' 'NR == 1 {next} FNR==NR {if (maxP < $2) maxP=$2; if (minD=="" || minD > $3) minD=$3; next} FNR <= 2 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))' file{,}

ID, POP, dG
1, 10, -5.6200
8, 5, -4.4500
10, 8, -4.1400
13, 5, -4.0100
16, 10, -3.8800

为了使其更具可读性：

awk -F ', ' '
NR == 1 {next}                   # skip 1st record 1st time
FNR == NR {
   if (maxP < $2)                # compute max(POP)
      maxP = $2
   if (minD == "" || minD > $3)  # compute min(dG)
      minD = $3
   next
}
# print if 1st 2 lines OR "$2 >= .5 * max(POP) && $3 <= .5 * min(dG)"
FNR <= 2 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))
' file{,}

【讨论】：

我现在要测试几个 CSV 输入。第三列的条件看起来非常好且正确，$3
哦，这很简单。你可以这样做：awk -F ', ' 'NR == 1 {next} FNR==NR {if (maxP < $2) maxP=$2; if (minD=="" || minD > $3) minD=$3; next} FNR == 1 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))' file{,}
太棒了！再一次非常感谢你！干杯，
太棒了！不敢相信这是可能的：o！再次非常感谢！
这可能更复杂，但也需要优雅的 AWK 解决方案：>> stackoverflow.com/questions/67089862/…