【发布时间】:2014-08-21 11:03:48
【问题描述】:
我们有一个 csv:
targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme ,
51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
我们的最终目标是:如果“cpd_number”($2) 相同但“discipline”($10) 不是“Cell”,则合并行“discipline”($10) 是“Cell”而不是“细胞”在一起。 (“学科”只有 3 个选项:生化、细胞、酶。)以下是理想的输出。
(注)新的“result_value” ($7) = “discipline” ($10) 为 “Cell” 的那行的“result_value” ($7) 除以 “discipline” ($10) 所在的那行的“result_value” ($7)是“生化”或“酶”。
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
一次做这件事看起来很复杂。因此,我首先尝试合并整行:如果“cpd_number”($2)相同但“discipline”($10)“不同”,则将“discipline”($10)是“Cell”行合并在一起带有“纪律”(10美元)的线条不是“细胞”。合并后,我们可以使用 awk 进一步清理/重命名标题。任何大师都可以提供一些想法如何写这个单行吗?这只是一个玩具示例。实际的 csv 文件非常大,因此以 /^51/ 开头可能并不理想。谢谢!
targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag, targetID , cpd_number , assay_id , alt_assay_id , type_desc , operator , result_value, unit_value , experiment_date , discipline, activity_flag
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical , 51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme , 51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
额外的例子:
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51, cpd-7788990 ,9999,0, IC50 ,,10, uM , 2006-07-01 00:00:00 , Biochemical ,
51, cpd-7788990 ,4444,5555, Ki , > ,5, uM , 2006-08-01 00:00:00 , Enzyme ,
51, cpd-7788990 ,1212,2323, IC50 ,,100, uM , 2006-09-01 00:00:00 , Cell ,
51, cpd-7788990 ,8888,9999, IC50 ,,200, uM , 2006-09-01 00:00:00 , Cell ,
输出:
targetID , cpd_number , Cell_assay_id , Cell_alt_assay_id , type_desc , assay_id , alt_assay_id , type_desc ,Operator, result_value, unit_value ,Cell_experiment_date,experiment_date, Cell_discipline , discipline
51,cpd-7788990,1212,2323, IC50 ,9999,0,IC50,,10,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,1212,2323, IC50 ,4444,5555,Ki,>,20,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
51,cpd-7788990,8888,9999, IC50 ,9999,0,IC50,,20,uM, 2006-09-01 00:00:00 , 2006-07-01 00:00:00 ,Cell,Biochemical
51,cpd-7788990,8888,9999, IC50 ,4444,5555,Ki,>,40,uM, 2006-09-01 00:00:00 , 2006-08-01 00:00:00 ,Cell,Enzyme
【问题讨论】:
-
到目前为止你做了什么?