如何根据 R 中的多个条件从大型数据帧中提取不同长度的向量答案

【问题标题】：How to extract vectors of different lengths from large dataframe depending on multiple conditions in R如何根据 R 中的多个条件从大型数据帧中提取不同长度的向量
【发布时间】：2023-12-22 15:31:01
【问题描述】：

我在 R 中有一个由 3 列组成的数据框。有点像这样：

  x      id trialNumber
1 1.4788 subj_01    trial010
2 1.4794 subj_01    trial010
3 1.4823 subj_01    trial010
4 1.4845 subj_01    trial010
5 1.4889 subj_01    trial010
6 1.4901 subj_01    trial010
...
20121 -1.3597 subj_03    trial042
20122 -1.3601 subj_03    trial042
20123 -1.3667 subj_03    trial042
20124 -1.3713 subj_03    trial042
20125 -1.3800 subj_03    trial042
20126 -1.3857 subj_03    trial042

我想为 x 创建一个包含多列的新数据框；其中列由 id 和 trialNumber 定义。 id 和 trialNumber 的每个组合的行数各不相同。新数据框中的行数应该对应于所有 id 和 trialNumber 组合的最大行数。结果应该是这样的：

x1      x2   ... xi
1.4788  1.5678  ...
1.4794  1.5789  ...
1.4823  1.5984  ...
1.4845  ...     ...
1.4889  NA      ...
1.4901  NA      -1.3713
...     ...     -1.3800
NA      ...     -1.3857

新数据帧中的 x1 到 xi 应该对应于原始数据帧中 id 和 trialNumber 的每个唯一组合，例如x1 将对应于 id == 'subj01' 和 trialNumber == 'trial010' 的所有 x。

id 和 trialNumber 的组合很多，所以我不想手动定义对原始数据框进行子集化的条件。

【问题讨论】：

出于好奇提出一个严肃的问题：你为什么要这个？您的数据现在格式如此整洁。
然后我想计算新数据框中每一行的 rowMeans 和置信区间。如果有办法用旧数据框做到这一点，那就更好了。
我认为这是一个 x-y 问题！您可以使用 base-package 中的聚合，或查看 data.table。
你确定要按行吗？你不是说专栏吗？
不确定聚合是否对我有帮助，因为它需要传递一个总结列的 FUN 参数（我不想要）。例如，'aggregate(df, by=list(df$id,df$trialNumber), FUN=mean, na.rm=FALSE)'。

标签： r dataframe extract subset reshape

【解决方案1】：

如果您真的希望将试验和主题的每个组合的 x 列绑定在一起，可以使用以下方法：

#step 1: create vector of x per combination

step1 <- split(dat2$x, list(dat2$trial,dat2$subject))

#calculate max length(to add padding)
max_length <- max(sapply(step1,length))

#make all vectors same length padded with NA
step2 <- lapply(step1, function(x){
  length(x) <- max_length
  x
})

#combine

res <- do.call(cbind,step2)
res

用于数据生成的代码：

set.seed(100)

dat1 <-expand.grid(trial=sprintf("trial_%.03d",1:10), 
                   subject= sprintf("subj_%.02d",1:3))

dat2 <- dat1[sample(nrow(dat1),1000,T),]
dat2$x <- rnorm(nrow(dat2))

【讨论】：

【解决方案2】：

您可以尝试（阅读上述 cmets 后的建议）：

tapply(df$x, paste0(df$id,df$trialNumber), function(x) data.frame(mean = mean(x), lower_limit = mean(x) - sd(x), upper_limit = mean(x) + sd(x)))
$subj_01trial010
      mean lower_limit upper_limit
1 1.484871    1.479965    1.489778

$subj_03trial042
       mean lower_limit upper_limit
1 -1.370583   -1.381177    -1.35999

或者使用aggregate，你会得到一个更好的输出格式：

aggregate(x ~ id + trialNumber, data = df, FUN = function(x) c(mean = mean(x), lower_limit = mean(x) - sd(x), upper_limit = mean(x) + sd(x)))
       id trialNumber    x.mean x.lower_limit x.upper_limit
1 subj_01    trial010  1.484871      1.479965      1.489778
2 subj_03    trial042 -1.370583     -1.381177     -1.359990

【讨论】：