【发布时间】:2018-07-23 13:56:02
【问题描述】:
努力将我凌乱、长度不等的data.frame 从宽表转换为长表,然后折叠(汇总)新变量。目前看起来是这样的,Gene 作为一个变量,GO_terms 作为一个包含多个逗号分隔值的变量:
Gene GO_terms
AA1006G00001 GO:0098655, GO:0008643, GO:0005351, GO:0005886, GO:0016021
AA100G00001 GO:0098655, GO:0009944, GO:0009862, GO:0010075, GO:0010014, GO:0009855, GO:0010310
AA100G00002 GO:0098655, GO:0008643, GO:0005886
我要做的第一步是转换为“长”格式,所以它看起来像这样:
Gene GO_terms
AA1006G00001 GO:0098655
AA1006G00001 GO:0008643
AA1006G00001 GO:0005351
AA1006G00001 GO:0005886
AA1006G00001 GO:0016021
AA100G00001 GO:0001666
AA100G00001 GO:0009944
AA100G00001 GO:0009862
AA100G00001 GO:0010075
AA100G00001 GO:0010014
AA100G00001 GO:0009855
AA100G00001 GO:0010310
AA100G00002 GO:0008270
AA100G00002 GO:0005634
AA100G00002 GO:0005886
AA100G00003 GO:0005488
AA100G00003 GO:0005634
然后,我希望通过切换两个变量来重新组织这个data.table,因为它的整理如下:
GO_terms Genes
GO:0005351 AA1006G00001
GO:0005886 AA1006G00001, AA100G00002
GO:0008643 AA1006G00001, AA100G00002
GO:0009855 AA100G00001
GO:0009862 AA100G00001
GO:0009944 AA100G00001
GO:0010014 AA100G00001
GO:0010075 AA100G00001
GO:0010310 AA100G00001
GO:0016021 AA1006G00001
GO:0098655 AA1006G00001, AA100G00001, AA100G00002
包含基因的变量可以在一列中(使用逗号分隔值),也可以在多列中。
谁能提供tidyr、reshape2 或dplyr 解决方案?
编辑:dput() 表是:
structure(list(`Gene ` = c("AA1006G00001\t", "AA100G00001\t",
"AA100G00002\t"), `GO_terms ` = c("GO:0098655, GO:0008643, GO:0005351, GO:0005886, GO:0016021\t\t",
"GO:0098655, GO:0009944, GO:0009862, GO:0010075, GO:0010014, GO:0009855, GO:0010310",
"GO:0098655, GO:0008643, GO:0005886")), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"), spec = structure(list(
cols = list(`Gene ` = structure(list(), class = c("collector_character",
"collector")), `GO_terms ` = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector"))), class = "col_spec"))
【问题讨论】:
-
到目前为止您尝试过什么?你在什么时候坚持?我认为 SO 对您的每个部分问题都有很多答案。
-
你能像
dput一样发布第一个数据框吗?这会更有帮助,因为您正在处理我们无法从复制和粘贴文本中得到的结构问题 -
@AndreElrico 我尝试先按列分隔 GO_terms,然后使用
Gene变量尝试 melt()。结果不正确,有很多空白值(因为我的 GO_terms 长度不等?)。
标签: r dataframe dplyr tidyr reshape2