【问题标题】:split up data frame into chunks and then apply function将数据帧分成块然后应用函数
【发布时间】:2020-10-26 15:59:14
【问题描述】:

我有一个(大)数据集,如下所示:-

dat <- data.frame(m=c(rep("a",4),rep("b",3),rep("c",2)),
          n1 =round(rnorm(mean = 20,sd = 10,n = 9)))

g <- rnorm(20,10,5)


dat
  m     n1
1 a 15.132
2 a 17.723
3 a  3.958
4 a 19.239
5 b 11.417
6 b 12.583
7 b 32.946
8 c 11.970
9 c 26.447

我想用vectorg like对“m”的每个类别进行t检验

n1.a <- c(15.132,17.723,3.958,19.329)

我需要做一个像t.test(n1.a,g)这样的t检验

我最初考虑使用split(dat,dat$m) 和 然后使用lapply,但它不起作用。

有什么想法吗?

【问题讨论】:

  • 为什么不工作?
  • 列表的每个元素都是一个类似列表的数据框,我无法从中提取“n1”值
  • rep(c("a", "b", "c"), 4:2)
  • @Edward ,使用单独的向量进行 t 检验
  • lapply(split(dat, dat$m), function(x) t.test(x$n1)) 你在找什么吗?

标签: r dataframe subset lapply


【解决方案1】:

这是使用来自purrrmaptidyverse 解决方案:

dat %>% 
  split(.$m) %>% 
  map(~ t.test(.x$n1, g), data = .x$n1)

或者,使用您提到的lapply,它将所有 t 检验统计信息存储在一个列表中(或者使用 by 的较短版本,感谢@markus):

dat <- split(dat, dat$m)
dat <- lapply(dat, function(x) t.test(x$n1, g))

或者

dat <- by(dat, m, function(x) t.test(x$n1, g))

这给了我们:

  $a

    Welch Two Sample t-test

data:  .x$n1 and g
t = 1.5268, df = 3.0809, p-value = 0.2219
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.61161  33.64902
sample estimates:
mean of x mean of y 
  21.2500   10.2313 


$b

    Welch Two Sample t-test

data:  .x$n1 and g
t = 1.8757, df = 2.2289, p-value = 0.1883
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.325666 20.863073
sample estimates:
mean of x mean of y 
  17.0000   10.2313 


$c

    Welch Two Sample t-test

data:  .x$n1 and g
t = 10.565, df = 19, p-value = 2.155e-09
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  7.031598 10.505808
sample estimates:
mean of x mean of y 
  19.0000   10.2313 

【讨论】:

  • 你的第二个选项有点短:by(dat, m, function(x) t.test(x$n1))
【解决方案2】:

在基础 R 中你可以做到

lapply(split(dat, dat$m), function(x) t.test(x$n1, g))

输出

$a

    Welch Two Sample t-test

data:  x$n1 and g
t = 1.9586, df = 3.2603, p-value = 0.1377
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -6.033451 27.819258
sample estimates:
mean of x mean of y 
  21.0000   10.1071 


$b

    Welch Two Sample t-test

data:  x$n1 and g
t = 2.3583, df = 2.3202, p-value = 0.1249
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.96768 25.75349
sample estimates:
mean of x mean of y 
  20.0000   10.1071 


$c

    Welch Two Sample t-test

data:  x$n1 and g
t = 13.32, df = 15.64, p-value = 6.006e-10
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 13.77913 19.00667
sample estimates:
mean of x mean of y 
  26.5000   10.1071 

数据

set.seed(1)
dat <- data.frame(m=c(rep("a",4),rep("b",3),rep("c",2)),
          n1 =round(rnorm(mean = 20,sd = 10,n = 9)))
g <- rnorm(20,10,5)

【讨论】:

  • 它适用于 t 检验,但当我用自定义函数替换它时不起作用。f.t &lt;- function(x){ z &lt;- t.test(x,g)$p.value return(z) }
  • 如果只需要获取p值,可以使用lapply(split(dat, dat$m), function(x) t.test(x$n1, g)$p.value)
  • tapply(dat$n1, dat$m, FUN=t.test, y=g)
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2010-11-26
  • 1970-01-01
  • 1970-01-01
  • 2017-08-12
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多