从数据框列表中提取观察值的索引列表答案

【问题标题】：A list of indices to extract observations from a list of data frames从数据框列表中提取观察值的索引列表
【发布时间】：2021-04-09 21:07:00
【问题描述】：

我有一个索引列表，我想使用这些索引从数据框列表中提取观察结果。一个简化的例子如下：

#A list of indices used to extract observations based on the time column from the `dat` dataset
time.index <- list(c(1,2,3), c(4,5,6), c(2,3,4))
#A list of data frames in which observations will be extracted based on the time column 
dat <- list(case1=data.frame(time=1:10, y=rnorm(10)), case2=data.frame(time=1:10, y=rnorm(10)), case3=data.frame(time=1:10, y=rnorm(10)))
#The expected result will be like this:
$case1
   time          y
1     1 -0.8954070
2     2  0.0270242
3     3 -0.4256890

$case2
   time       y
4     4  1.5789
5     5 -0.6692
6     6 -2.3306

$case3
   time       y
2     2 -0.7371
3     3 -0.3271
4     4  0.4128

有人知道如何实现吗？非常感谢！

【问题讨论】：

标签： r dataframe extract data-manipulation

【解决方案1】：

在基础 R 中，lapply 完成了这项工作：

setNames(lapply( 1:length(time.index),
   function(x) dat[[x]][dat[[x]]$time %in% time.index[[x]],] ),
   names(dat) )
#$case1
#  time          y
#1    1  1.7458360
#2    2 -0.6945523
#3    3 -0.3699472

#$case2
#  time          y
#4    4  0.5407011
#5    5 -0.3895972
#6    6 -1.1165133

#$case3
#  time          y
#2    2 -0.8736470
#3    3  0.1831833
#4    4  1.0551148

【讨论】：

【解决方案2】：

您可以使用Map：

Map(function(x, y) x[x$time %in% y, ], dat, time.index)

#$case1
#  time     y
#1    1  1.75
#2    2  1.13
#3    3 -1.45

#$case2
#  time     y
#4    4 2.212
#5    5 0.572
#6    6 0.149

#$case3
#  time       y
#2    2 -0.0377
#3    3 -0.1700
#4    4  0.8414

同样，使用purrr 的map2：

purrr::map2(dat, time.index, ~.x[.x$time %in% .y, ])

【讨论】：

好多了。我在想这可能是Map 问题。

【解决方案3】：

一个简单的方法（如果你可以忽略丑陋的括号）是老式的、不应用循环方法。

res <- list()
for(ii in 1:length(time.index)){
  res[[ii]] <- dat[[ii]][dat[[ii]]$time %in% time.index[[ii]] ,]
}

res
[[1]]
  time           y
1    1 -0.05802713
2    2 -0.80779933
3    3 -1.77802107

[[2]]
  time          y
4    4  0.3990907
5    5 -1.5834484
6    6 -0.3626801

[[3]]
  time          y
2    2 -1.8585653
3    3  1.0591013
4    4  0.6903189

您可以在循环中或之后添加名称，如下所示：

names(res) <- names(dat)

【讨论】：

谢谢！你知道有什么更快的方法吗？也许使用 *apply 系列？我有一个更大的数据集，因此使用 for 循环可能会很慢
Ronak Shah 在我回答之后回答了 - 我喜欢他的回答，而且它似乎避免了循环（取决于 Map 在后台的实现方式）。即使需要很长时间，他的答案将来也会更容易维护，而我的答案有点难以阅读，因此很容易被破解。