解压缩列表的 R 数据框列答案

【问题标题】：Unpack a R data frame column of lists解压缩列表的 R 数据框列
【发布时间】：2016-05-09 14:21:11
【问题描述】：

在 R 中，我有一个 data.frame（或 data.table）。在这个data.frame中，我有一列，每个单元格都由一个列表列表（一个data.frame）组成。

我可以通过rbindlist(data$Subdocuments) 将此列转换为单个 data.frame，但这里缺少原始 data.frame 的其他列。

如何有效地解包这一列列表，但保持其他列附加到新的 data.frame？

     library(data.table)

    data <- structure(list(ID = c("1", "2", "3"), Country = c("Netherlands", 
"Germany", "Belgium"), Subdocuments = list(structure(list(Value = c("5", 
"5", "1", "3", "2", "1", "1", "1", "2", "5", "3", "2", "4", "5", 
"5", "2"), Label = c("Test1", "Test2", "Test3", "Test4", "Test5", 
"Test6", "Test7", "Test8", "Test9", "Test10", "Test11", "Test12", 
"Test13", "Test14", "Test15", "Test16"), Year = c(2001, 2002, 
2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 
2014, 2015, 2016)), .Names = c("Value", "Label", "Year"), class = "data.frame", row.names = c(NA, 
16L)), structure(list(Value = c("5", "4", "3", "2", "2", "2", 
"1", "1", "5", "4", "4", "4", "5", "1", "1", "3"), Label = c("Test1", 
"Test2", "Test3", "Test4", "Test5", "Test6", "Test7", "Test8", 
"Test9", "Test10", "Test11", "Test12", "Test13", "Test14", "Test15", 
"Test16"), Year = c(2001, 2002, 2003, 2004, 2005, 2006, 2007, 
2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)), .Names = c("Value", 
"Label", "Year"), class = "data.frame", row.names = c(NA, 16L
)), structure(list(Value = c("1", "2", "3", "1", "1", "4", "5", 
"1", "2", "3", "2", "2", "1", "1", "1", "5"), Label = c("Test1", 
"Test2", "Test3", "Test4", "Test5", "Test6", "Test7", "Test8", 
"Test9", "Test10", "Test11", "Test12", "Test13", "Test14", "Test15", 
"Test16"), Year = c(2001, 2002, 2003, 2004, 2005, 2006, 2007, 
2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)), .Names = c("Value", 
"Label", "Year"), class = "data.table", row.names = c(NA, 16L
)))), .Names = c("ID", "Country", "Subdocuments"), row.names = c(NA, 
-3L), class = "data.frame")

【问题讨论】：

您的数据在list 列中显示了很多NA 行。
@akrun 抱歉，data.frame 的输入有问题。我解决了。
也许是setDT(data)[, .SD[[1L]][[1L]], by=.(ID, Country)]?

标签： r data.table

【解决方案1】：

我愿意

setDT(data)

dfcol   = "Subdocuments"
othcols = setdiff(names(data), dfcol)

subs = rbindlist(data[[dfcol]], id=TRUE)
subs[, (othcols) := data[.id, othcols, with=FALSE]]

如果你不想setDT(data)，你可以把最后一行改成data[.id, othcols]。

【讨论】：

运行时：subs = rbindlist(data[[dfcol]], id=TRUE) 我收到以下错误：Error in rbindlist(data[[dfcol]], id = TRUE) : unused argument (id = TRUE)
@Berecht 请将您的 data.table 更新为 1.9.6，因为 id 功能是最新的

【解决方案2】：

这可能有帮助

library(data.table)
rbindlist(setNames(data[[3]], do.call(paste, data[1:2])), idcol=TRUE)[
        , c("ID", "Country") := tstrsplit(.id, " ")][, .id := NULL][]
# Value  Label Year ID     Country
# 1:     5  Test1 2001  1 Netherlands
# 2:     5  Test2 2002  1 Netherlands
# 3:     1  Test3 2003  1 Netherlands
# 4:     3  Test4 2004  1 Netherlands
# 5:     2  Test5 2005  1 Netherlands
# 6:     1  Test6 2006  1 Netherlands
# 7:     1  Test7 2007  1 Netherlands
# 8:     1  Test8 2008  1 Netherlands
# 9:     2  Test9 2009  1 Netherlands
#10:     5 Test10 2010  1 Netherlands
#11:     3 Test11 2011  1 Netherlands
#12:     2 Test12 2012  1 Netherlands
#13:     4 Test13 2013  1 Netherlands
#14:     5 Test14 2014  1 Netherlands
#15:     5 Test15 2015  1 Netherlands
#16:     2 Test16 2016  1 Netherlands
#17:     5  Test1 2001  2     Germany
#18:     4  Test2 2002  2     Germany
#19:     3  Test3 2003  2     Germany
#20:     2  Test4 2004  2     Germany
#21:     2  Test5 2005  2     Germany
#22:     2  Test6 2006  2     Germany
#23:     1  Test7 2007  2     Germany
#24:     1  Test8 2008  2     Germany
#25:     5  Test9 2009  2     Germany
#26:     4 Test10 2010  2     Germany
#27:     4 Test11 2011  2     Germany
#28:     4 Test12 2012  2     Germany
#29:     5 Test13 2013  2     Germany
#30:     1 Test14 2014  2     Germany
#31:     1 Test15 2015  2     Germany
#32:     3 Test16 2016  2     Germany
#33:     1  Test1 2001  3     Belgium
#34:     2  Test2 2002  3     Belgium
#35:     3  Test3 2003  3     Belgium
#36:     1  Test4 2004  3     Belgium
#37:     1  Test5 2005  3     Belgium
#38:     4  Test6 2006  3     Belgium
#39:     5  Test7 2007  3     Belgium
#40:     1  Test8 2008  3     Belgium
#41:     2  Test9 2009  3     Belgium
#42:     3 Test10 2010  3     Belgium
#43:     2 Test11 2011  3     Belgium
#44:     2 Test12 2012  3     Belgium
#45:     1 Test13 2013  3     Belgium
#46:     1 Test14 2014  3     Belgium
#47:     1 Test15 2015  3     Belgium
#48:     5 Test16 2016  3     Belgium

注意：“数据”来自 OP 自己的帖子。

或使用dplyr

library(dplyr)
bind_rows(data[[3]], .id="ID") %>% 
            left_join(data[-3], ., by = "ID")

【讨论】：

我收到以下错误：Error in rbindlist(setNames(data[[3]], do.call(paste, data[1:2])), idcol = TRUE) : unused argument (idcol = TRUE)。也许问题是我正在使用的包 data.table 的版本（1.9.4）。我现在会检查一下。
@Berecht 是版本问题。我使用 1.9.6
@Berecht 这个答案粘贴列然后拆分它们，从而将所有内容转换为字符串并丢失信息。除此之外，它会无缘无故地多次复制相同的数据。绝对不是要走的路。