基于多个字段匹配/不匹配的awk合并行答案

【问题标题】：awk merge lines based on multiple field matching/unmatching基于多个字段匹配/不匹配的awk合并行
【发布时间】：2014-08-21 11:03:48
【问题描述】：

我们有一个 csv：

targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme ,
51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,

我们的最终目标是：如果“cpd_number”($2) 相同但“discipline”($10) 不是“Cell”，则合并行“discipline”($10) 是“Cell”而不是“细胞”在一起。（“学科”只有 3 个选项：生化、细胞、酶。）以下是理想的输出。
（注）新的“result_value” ($7) = “discipline” ($10) 为 “Cell” 的那行的“result_value” ($7) 除以 “discipline” ($10) 所在的那行的“result_value” ($7)是“生化”或“酶”。

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme

一次做这件事看起来很复杂。因此，我首先尝试合并整行：如果“cpd_number”（$2）相同但“discipline”（$10）“不同”，则将“discipline”（$10）是“Cell”行合并在一起带有“纪律”（10美元）的线条不是“细胞”。合并后，我们可以使用 awk 进一步清理/重命名标题。任何大师都可以提供一些想法如何写这个单行吗？这只是一个玩具示例。实际的 csv 文件非常大，因此以 /^51/ 开头可能并不理想。谢谢！

targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag, targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator ,   result_value, unit_value , experiment_date , discipline, activity_flag 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical , 51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme , 51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,

额外的例子：

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51, cpd-7788990 ,9999,0,  IC50  ,,10, uM ,  2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM ,  2006-08-01 00:00:00  ,  Enzyme ,
51, cpd-7788990 ,1212,2323,  IC50  ,,100, uM ,  2006-09-01 00:00:00  , Cell ,
51, cpd-7788990 ,8888,9999,  IC50  ,,200, uM ,  2006-09-01 00:00:00  , Cell ,

输出：

targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator,   result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline 
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
51,cpd-7788990,8888,9999, IC50 ,9999,0,IC50,,20,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,8888,9999, IC50 ,4444,5555,Ki,>,40,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme

【问题讨论】：

到目前为止你做了什么？

标签： bash csv awk merge

【解决方案1】：

这是一个 awk 脚本，它根据您的示例输入和最终所需的输出组合在一起。随意调整它以满足您的需求。它应该足以让您入门。您的 csv 文件需要两次传递。在第一遍中，它基于第二列构建一个数组，其中规则为单元格，在第二遍中，它将行格式化在一起。由于您没有说明如何处理没有 Cell 规则的行，因此以下解决方案将忽略它们。

script.awk 的内容

BEGIN { 
    FS  = " *, *"                             # Set input field sep to this regex
    OFS = ","                                 # Set output field sep to comma
}

NR==FNR {                                     # In the first pass to the file
    if ($(NF-1) == "Cell") {                  # If the second last field is Cell
        flds[$2,$3,$4] = $3 OFS $4 OFS $5;    # Create an array to store col 3,4 and 5 separated by comma
        date[$2,$3,$4] = $9                   # Store date
        result[$2,$3,$4] = $7                 # Store col 7
    }
    next                                      # Move to the next record
} 

{                                             # For the second pass to the file
    for (fld in flds) {                       # For every entry in our array
        split (fld, tmp, SUBSEP);             # Split the composite key
        if ($(NF-1) != "Cell" && tmp[1] == $2) {   # If the last field is not Cell and first piece of key is same as col 2
            line = $0                         # Save the current line in a variable
            $3 = flds[fld] OFS $3             # modify col3 to put the value from array in front of col3
            $7 = result[fld] / $7             # Calculate the new result value
            $9 = date[fld] OFS $9             # Add the date
            $(NF-1) = "Cell" OFS $(NF-1)      # Place the Cell text
            NF--                              # Remove the last field
            print                             # print the line
            $0 = line                         # Swap the modified line back
        }
    }
}

$(NF-1) == "Cell" {                           # If the last field is Cell don't print it 
    next
}

运行如下：

$ awk -f script.awk file file
51,cpd-7788990,1212,2323,IC50,9999,0,IC50,,10,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,8888,9999,IC50,9999,0,IC50,,20,uM,2006-09-01 00:00:00,2006-07-01 00:00:00,Cell,Biochemical
51,cpd-7788990,1212,2323,IC50,4444,5555,Ki,>,20,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme
51,cpd-7788990,8888,9999,IC50,4444,5555,Ki,>,40,uM,2006-09-01 00:00:00,2006-08-01 00:00:00,Cell,Enzyme

您可以在 BEGIN 块内包含打印头语句。

【讨论】：

谢谢杰帕尔！一个非常巧妙的解决方案！但是，您能否在每一行添加一些注释？我尝试完全理解脚本，以便可以修改它。当前的示例在当前示例上完美运行。但是，我只是多放了一个“额外示例”，当前脚本将只保留一个 Cell:Enzyme 和 Cell:Biochemistry 行，而不是额外示例中的两个 Cell:Enzyme 和 Cell:Biochemical 行。
@Chubaka 您的新数据实际上修改了整个答案。我已经更新了它。请复制脚本并将其保存在文件中并像上面显示的那样运行它。我添加了 cmets 来指导您完成流程。