awk 根据 2 美元和 17 美元分开行，平均 17 美元答案

【问题标题】：awk separate rows based on $2 and $17 and do average on $17awk 根据 2 美元和 17 美元分开行，平均 17 美元
【发布时间】：2014-07-16 08:52:12
【问题描述】：

我们在这里输入：

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,5555,6666, IC50 ,>,150,uM,1334,1331,Ki,,10,uM,>,15,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-7788990,8888,9999, IC50 ,,200,uM,1334,1331,Ki,,10,uM,,20,-3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.3,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

我们想将此 input.csv 分成 2 个文件

如果 $2 相同且 $17

如果 $2 相同且 $17 > 1 "中的最大值减去最小值"，则平均 $17 并将其放入“文件 b”。

注意：如果 $2 本身是唯一的，我们希望将其保留在这里（以 cpd-6666666 为例）

注意：cpd-1111 ($17 max-min) = -1-(-1.3)=0.3

a: ($17 max-min)

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-6666666,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme
49,cpd-1111,8888,9999, IC50 ,,400,uM,1334,1331,Ki,,10,uM,,40,-1.15,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

b:where ($17 max-min)>1 。 cpd-7788990($2) 中的新 $17 是 (-1,-2,-3) = -2 的平均值

cpdID,cpd_number,Cell_assay_id,Cell_alt_assay_id,Cell_type_desc,Cell_Operator,Cell_result_value,Cell_unit_value,assay_id,alt_assay_id,type_desc,operator,result_value,unit_value,Ratio_operator,Ratio,log_ratio,Cell_experiment_date,experiment_date,Cell_discipline,discipline
49,cpd-7788990,1212,2323, IC50 ,,100,uM,1334,1331,Ki,,10,uM,,10,-2,12/6/2006 0:00,2/16/2007 0:00,Cell,Enzyme

这是可以将输入分为 a 和 b 但尚未完成平均的尝试。

#!/usr/bin/awk -f

BEGIN {FS=","; f1="a"; f2="b"}

FNR==1 { print $0 > f1; print $0 > f2; next }

$2!=last_id && FNR > 2 { handleBlock() }

{ a[++cnt]=$0; m[cnt]=$17; last_id=$2 }

END { handleBlock() }

function handleBlock() {

if( m[1]-m[cnt]<=1 ) fname = f1

else fname = f2

for( i=1;i<=cnt;i++ ) { print a[i] > fname }  

cnt=0
}

我可以知道是否可以在 a 和 b 中求平均值吗？谢谢。

【问题讨论】：

另请参阅：(1) Linux: sort $2 & $17 in numerical values; sort distant columns 和 (2) awk: separate rows if "$2 are the same and max and min value <= 1" ... 和 (3) Categorize CSV files based on $18 info ...。我没有要求任何直接重复身份；不过，他们的问题有些相关（根据数据集，如果没有别的）。
你说如果 $2 相同，并且 $17 如果 $2与什么相同？$3？$2 在下一行？$2 在上一行？其他一些条件？最大什么？最小什么？当一行是 $2 中唯一具有给定值的行时，应该在哪里打印？标准输出？这个问题需要认真澄清。我的下一条评论将尝试解释您的要求。
输入文件中的数据应该写入三个输出，'file a'，'file b'和标准输出。输出线的形状应与输入记录的形状相同。标题行（第 1 行）应写入所有三个输出。记录应根据 2 美元的价值进行分组。对数据进行排序，使得 $2 中具有相同值的行在输入中相邻。 [...继续...]
[...continuation...] 当组中只有一条记录时，应将该行写入标准输出。否则，应平均 17 美元的值。如果平均值大于 1，则将具有平均值的行代替 $17（该组的最后一行）写入“文件 b”；否则，写入“文件 a”。输出行的其他数据应该来自为组读取的最后一条记录。输入数据每行有21个字段。
难道你不能用更小、更简单的输入来描述你的问题吗？比如字段 2 和 4（共 5 个）而不是字段 2 和 17（天知道有多少）？我们中的更多人可能已经看过它......

标签： awk

【解决方案1】：

您可以通过更改handleBlock() 来获得输出文件中的平均值，如下所示：

function handleBlock() {
  if( m[1]-m[cnt]<=1 ) fname = f1
  else fname = f2
    # compute the sum of the $17 fields for the group
  for( i=1;i<=cnt;i++ ) { sum+=m[i] }
    # compute the average
  avg = cnt > 0 ? sum/cnt : sum
    # use the max line for the output, split into an output array: oarr
  fcnt = split( a[1], oarr )
    # modify the 17th field of the output array
  oarr[17]=avg
    # write the updated array to the desired file one field at a time
  for( i=1;i<=fcnt;i++ ) {
    printf( "%s%s", oarr[i], i==fcnt ? "\n" : FS ) > fname
  }
  cnt=0; sum=0
}

检查 here 以获取原始脚本中的 cmets。

【讨论】：