【问题标题】:Unpack a R data frame column of lists解压缩列表的 R 数据框列
【发布时间】:2016-05-09 14:21:11
【问题描述】:

在 R 中,我有一个 data.frame(或 data.table)。在这个data.frame中,我有一列,每个单元格都由一个列表列表(一个data.frame)组成。

我可以通过rbindlist(data$Subdocuments) 将此列转换为单个 data.frame,但这里缺少原始 data.frame 的其他列。

如何有效地解包这一列列表,但保持其他列附加到新的 data.frame?

     library(data.table)

    data <- structure(list(ID = c("1", "2", "3"), Country = c("Netherlands", 
"Germany", "Belgium"), Subdocuments = list(structure(list(Value = c("5", 
"5", "1", "3", "2", "1", "1", "1", "2", "5", "3", "2", "4", "5", 
"5", "2"), Label = c("Test1", "Test2", "Test3", "Test4", "Test5", 
"Test6", "Test7", "Test8", "Test9", "Test10", "Test11", "Test12", 
"Test13", "Test14", "Test15", "Test16"), Year = c(2001, 2002, 
2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 
2014, 2015, 2016)), .Names = c("Value", "Label", "Year"), class = "data.frame", row.names = c(NA, 
16L)), structure(list(Value = c("5", "4", "3", "2", "2", "2", 
"1", "1", "5", "4", "4", "4", "5", "1", "1", "3"), Label = c("Test1", 
"Test2", "Test3", "Test4", "Test5", "Test6", "Test7", "Test8", 
"Test9", "Test10", "Test11", "Test12", "Test13", "Test14", "Test15", 
"Test16"), Year = c(2001, 2002, 2003, 2004, 2005, 2006, 2007, 
2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)), .Names = c("Value", 
"Label", "Year"), class = "data.frame", row.names = c(NA, 16L
)), structure(list(Value = c("1", "2", "3", "1", "1", "4", "5", 
"1", "2", "3", "2", "2", "1", "1", "1", "5"), Label = c("Test1", 
"Test2", "Test3", "Test4", "Test5", "Test6", "Test7", "Test8", 
"Test9", "Test10", "Test11", "Test12", "Test13", "Test14", "Test15", 
"Test16"), Year = c(2001, 2002, 2003, 2004, 2005, 2006, 2007, 
2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016)), .Names = c("Value", 
"Label", "Year"), class = "data.table", row.names = c(NA, 16L
)))), .Names = c("ID", "Country", "Subdocuments"), row.names = c(NA, 
-3L), class = "data.frame")

【问题讨论】:

  • 您的数据在list 列中显示了很多NA 行。
  • @akrun 抱歉,data.frame 的输入有问题。我解决了。
  • 也许是setDT(data)[, .SD[[1L]][[1L]], by=.(ID, Country)]?

标签: r data.table


【解决方案1】:

我愿意

setDT(data)

dfcol   = "Subdocuments"
othcols = setdiff(names(data), dfcol)

subs = rbindlist(data[[dfcol]], id=TRUE)
subs[, (othcols) := data[.id, othcols, with=FALSE]]

如果你不想setDT(data),你可以把最后一行改成data[.id, othcols]

【讨论】:

  • 运行时:subs = rbindlist(data[[dfcol]], id=TRUE) 我收到以下错误:Error in rbindlist(data[[dfcol]], id = TRUE) : unused argument (id = TRUE)
  • @Berecht 请将您的 data.table 更新为 1.9.6,因为 id 功能是最新的
【解决方案2】:

这可能有帮助

library(data.table)
rbindlist(setNames(data[[3]], do.call(paste, data[1:2])), idcol=TRUE)[
        , c("ID", "Country") := tstrsplit(.id, " ")][, .id := NULL][]
# Value  Label Year ID     Country
# 1:     5  Test1 2001  1 Netherlands
# 2:     5  Test2 2002  1 Netherlands
# 3:     1  Test3 2003  1 Netherlands
# 4:     3  Test4 2004  1 Netherlands
# 5:     2  Test5 2005  1 Netherlands
# 6:     1  Test6 2006  1 Netherlands
# 7:     1  Test7 2007  1 Netherlands
# 8:     1  Test8 2008  1 Netherlands
# 9:     2  Test9 2009  1 Netherlands
#10:     5 Test10 2010  1 Netherlands
#11:     3 Test11 2011  1 Netherlands
#12:     2 Test12 2012  1 Netherlands
#13:     4 Test13 2013  1 Netherlands
#14:     5 Test14 2014  1 Netherlands
#15:     5 Test15 2015  1 Netherlands
#16:     2 Test16 2016  1 Netherlands
#17:     5  Test1 2001  2     Germany
#18:     4  Test2 2002  2     Germany
#19:     3  Test3 2003  2     Germany
#20:     2  Test4 2004  2     Germany
#21:     2  Test5 2005  2     Germany
#22:     2  Test6 2006  2     Germany
#23:     1  Test7 2007  2     Germany
#24:     1  Test8 2008  2     Germany
#25:     5  Test9 2009  2     Germany
#26:     4 Test10 2010  2     Germany
#27:     4 Test11 2011  2     Germany
#28:     4 Test12 2012  2     Germany
#29:     5 Test13 2013  2     Germany
#30:     1 Test14 2014  2     Germany
#31:     1 Test15 2015  2     Germany
#32:     3 Test16 2016  2     Germany
#33:     1  Test1 2001  3     Belgium
#34:     2  Test2 2002  3     Belgium
#35:     3  Test3 2003  3     Belgium
#36:     1  Test4 2004  3     Belgium
#37:     1  Test5 2005  3     Belgium
#38:     4  Test6 2006  3     Belgium
#39:     5  Test7 2007  3     Belgium
#40:     1  Test8 2008  3     Belgium
#41:     2  Test9 2009  3     Belgium
#42:     3 Test10 2010  3     Belgium
#43:     2 Test11 2011  3     Belgium
#44:     2 Test12 2012  3     Belgium
#45:     1 Test13 2013  3     Belgium
#46:     1 Test14 2014  3     Belgium
#47:     1 Test15 2015  3     Belgium
#48:     5 Test16 2016  3     Belgium

注意:“数据”来自 OP 自己的帖子。


或使用dplyr

library(dplyr)
bind_rows(data[[3]], .id="ID") %>% 
            left_join(data[-3], ., by = "ID")

【讨论】:

  • 我收到以下错误:Error in rbindlist(setNames(data[[3]], do.call(paste, data[1:2])), idcol = TRUE) : unused argument (idcol = TRUE)。也许问题是我正在使用的包 data.table 的版本(1.9.4)。我现在会检查一下。
  • @Berecht 是版本问题。我使用 1.9.6
  • @Berecht 这个答案粘贴列然后拆分它们,从而将所有内容转换为字符串并丢失信息。除此之外,它会无缘无故地多次复制相同的数据。绝对不是要走的路。
猜你喜欢
  • 2020-03-12
  • 1970-01-01
  • 1970-01-01
  • 2020-12-12
  • 2018-07-04
  • 1970-01-01
  • 2019-11-19
  • 1970-01-01
  • 2021-05-19
相关资源
最近更新 更多