如何将具有不同长度向量的大型列表转换为数据框？ [复制]答案

【问题标题】：How do you convert large list with vectors of different lenght to dataframe? [duplicate]如何将具有不同长度向量的大型列表转换为数据框？ [复制]
【发布时间】：2020-02-13 08:29:10
【问题描述】：

我有一个包含 30000 多个元素的大列表。有不同长度的向量，我想将列表转换为数据框，其中每个向量代表一行，其值分布在多列中。有一个列表的模拟示例：

lst <- list(a = c(1,2,4,5,6), c = c(7,8,9), c = c(10,11))

我想要的输出如下所示：

#  [,1]  [,2] [,3] [,4] [,5] [,6]
#a    1    2    3    4    5    6
#b    7    8    9   NA   NA   NA
#c   10   11   NA   NA   NA   NA

【问题讨论】：

很高兴看到一个简短且具有所需输出的可重现示例！
我觉得你真的想要一个矩阵而不是一个 data.frame - 如果不知道更多关于你在做什么，很难确定，但请记住，即使在 R 中，表格数据如果它不是面向列的，则不必在数据框中。
我添加了计时

标签： r list dataframe

【解决方案1】：

诀窍是制作等长的向量。此外，您似乎想在输出时使用矩阵。

Reduce(function(x,y){
  n <- max(length(x), length(y))
  length(x) <- n
  length(y) <- n
  rbind(x,y,deparse.level = 0)
},
       list(a = c(1,2,4,5,6), c = c(7,8,9), c = c(10,11)))

输出

# [,1] [,2] [,3] [,4] [,5]
# [1,]    1    2    4    5    6
# [2,]    7    8    9   NA   NA
# [3,]   10   11   NA   NA   NA

此时您可以重置行名。

更新有兴趣的人的时间安排：

lst <- list(a = c(1,2,4,5,6), c = c(7,8,9), c = c(10,11))

convert <-function(lst){
  Reduce(function(x,y){
    n <- max(length(x), length(y))
    length(x) <- n
    length(y) <- n
    rbind(x,y,deparse.level = 0)
  },
  lst)
}

convert2 <- function(lst){
  t(sapply(lst, "length<-", max(lengths(lst))))
}

convert3 <- function(lst){
t(as.data.frame(lapply(lst, "length<-", max(lengths(lst)))))
}

microbenchmark::microbenchmark(convert(lst),
                               convert2(lst),
                               convert3(lst))

#Unit: microseconds
#          expr     min       lq      mean   median      uq      max neval
#  convert(lst)  41.962  50.0725 106.47314  62.2375  68.408 4392.895   100
# convert2(lst)  28.209  33.6755  69.93855  40.7280  45.136 2298.002   100
# convert3(lst) 292.673 306.6005 381.59504 319.1180 333.399 2887.929   100

【讨论】：

【解决方案2】：

你可以这样做：

t(as.data.frame(lapply(lst, "length<-", max(lengths(lst)))))

#    [,1] [,2] [,3] [,4] [,5]
#a      1    2    4    5    6
#c      7    8    9   NA   NA
#c.1   10   11   NA   NA   NA

或者正如@Andrew 指出的，你可以这样做：

t(sapply(lst, "length<-", max(lengths(lst))))

#  [,1] [,2] [,3] [,4] [,5]
#a    1    2    4    5    6
#c    7    8    9   NA   NA
#c   10   11   NA   NA   NA

【讨论】：

如果之后转置，则无需将其包装在 as.data.frame 中。你可以使用sapply 考虑到t 无论如何都会转换为矩阵。
返回错误：Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0

【解决方案3】：

这是一个基于 R 的选项：

# Create a vector for number of times an NA needs to be padded
na_nums <- max(lengths(lst)) - lengths(lst)

# Transpose results after patting NA's using mapply
t(mapply(c, lst, sapply(na_nums, rep, x = NA)))
  [,1] [,2] [,3] [,4] [,5]
a    1    2    4    5    6
c    7    8    9   NA   NA
c   10   11   NA   NA   NA

【讨论】：

【解决方案4】：

这是我的第一个冲动。

max_len <- max(vapply(lst, 
                      FUN = length, 
                      FUN.VALUE = numeric(1)))

lst <- lapply(lst, 
              function(x, max_len) c(x, rep(NA, max_len - length(x))), 
              max_len)

# Form a matrix
do.call("rbind", lst)

这有点冗长，其他一些解决方案相当优雅。既然您说您的列表超过 30,000 个元素，我很好奇这些元素在长度为 30,000 的列表上的表现如何。

如果这是你需要经常做的事情，你可能想采用安德鲁的方法。

lst <- list(a = c(1,2,4,5,6), c = c(7,8,9), c = c(10,11))
# build out a list of 30,000 elements.
lst <- lst[sample(1:3, 30000, replace = TRUE)]

library(microbenchmark)
microbenchmark(
  benjamin = {
    max_len <- max(vapply(lst, 
                          FUN = length, 
                          FUN.VALUE = numeric(1)))

    lst <- lapply(lst, 
                  function(x, max_len) c(x, rep(NA, max_len - length(x))), 
                  max_len)

    # Form a matrix
    do.call("rbind", lst)
  }, 
  slava = {
    Reduce(function(x,y){
      n <- max(length(x), length(y))
      length(x) <- n
      length(y) <- n
      rbind(x,y,deparse.level = 0)
    },
    lst)
  }, 
  andrew = {
    na_nums <- max(lengths(lst)) - lengths(lst)

    # Transpose results after patting NA's using mapply
    t(mapply(c, lst, sapply(na_nums, rep, x = NA)))
  }, 
  matt = {
    t(as.data.frame(lapply(lst, "length<-", max(lengths(lst)))))
  }
)

Unit: milliseconds
     expr         min          lq       mean      median          uq        max neval
 benjamin    77.08337    91.42793   117.9376   106.97656   122.53898   191.6612     5
    slava 32383.10840 32962.57589 32976.6662 33071.40314 33180.70634 33285.5372     5
   andrew    60.91803    66.74401    87.1645    71.92043    77.78805   158.4520     5
     matt  1685.09158  1702.19796  1759.2741  1737.01949  1760.86237  1911.1993     5

【讨论】：

感谢您提供更大数据集的时间。