【发布时间】:2020-05-15 23:18:23
【问题描述】:
我有使用以下 dplyr 代码创建的表:
数据
demo <- structure(list(`Performance-1` = c(4, 5, 3, 3, 5, 4, 4, 4, 4,
4, 5, 4, 5, 5, 3, 5, 2, 3, 3, 4, 4, 5, 4, 3, 4), `Performance-2` = c(4,
5, 3, 3, 5, 4, 4, 3, 3, 4, 5, 5, 5, 4, 3, 5, 2, 3, 3, 4, 4, 5,
4, 3, 3), Gender = c("Male", "Female", "Male", "Male", "Male",
"Female", "Male", "Female", "Male", "Male", NA, "Male", "Male",
"Male", "Male", "Male", NA, "Female", NA, "Female", "Male", "Male",
"Male", "Male", NA)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"))
这只是我无法访问的主要数据的一个示例。下面的结果可能不同
analysis_vars <- c("Performance-1", "Performance-2")
demo %>%
pivot_longer(cols = analysis_vars,names_to = "Performance") %>%
select(Performance, value, Gender) %>%
filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2")) %>%
group_by(Gender, Performance) %>%
summarise(mean = round(mean(value, na.rm=T),2), sd = round(sd(value, na.rm=T),2), N = sum(!is.na(value))) %>%
pivot_wider(names_from = Gender, values_from = c(mean, sd, N))
Performance mean_Female mean_Male sd_Female sd_Male N_Female N_Male
Performance-1 4.14 4.10 0.79 0.79 428 896
Performance-2 4.00 3.91 0.87 0.86 427 897
我想得到一个大均值和大标准偏差以及最后一行,但我无法弄清楚。
当我尝试以下代码时:
demo %>%
pivot_longer(cols = analysis_vars,names_to = "Performance") %>%
select(Performance, value, Gender) %>%
filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2")) %>%
group_by(Gender, Performance) %>%
summarise(mean = round(mean(value, na.rm=T),2), sd = round(sd(value, na.rm=T),2), N = sum(!is.na(value))) %>% ungroup() %>%
add_row(mean = sum(.$mean*.$N)/sum(.$N), sd = sum(.$N-1)*.$sd/sum(.$N)) %>%
pivot_wider(names_from = Gender, values_from = c(mean, sd, N))
我得到的结果是这样的:
Performance mean_Female mean_Male sd_Female sd_Male N_Female N_Male
Performance-1 <dbl [1]> <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <NULL>
Performance-2 <dbl [1]> <dbl [1]> <NULL> <dbl [1]> <dbl [1]> <NULL>
当我移除 pivot_wider(最后一行)以查看发生了什么时,这就是我所看到的。似乎它为两种性别都添加了行。
Gender Performance mean sd N
Female Performance-1 4.140000 0.7900000 428
Female Performance-2 4.000000 0.8700000 427
Male Performance-1 4.100000 0.7900000 896
Male Performance-2 3.910000 0.8600000 897
NA NA 4.025978 0.7888066 NA
NA NA 4.025978 0.8686858 NA
NA NA 4.025978 0.7888066 NA
NA NA 4.025978 0.8587009 NA
所以,我想也许我应该在旋转后这样做:
demo %>%
pivot_longer(cols = analysis_vars,names_to = "Performance") %>%
select(Performance, value, Gender) %>%
filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2")) %>%
group_by(Gender, Performance) %>%
summarise(mean = round(mean(value, na.rm=T),2), sd = round(sd(value, na.rm=T),2), N = sum(!is.na(value))) %>% ungroup() %>%
pivot_wider(names_from = Gender, values_from = c(mean, sd, N)) %>% ungroup() %>%
add_row(mean_Male = sum(.$mean_Male*.$N_Male)/sum(.$N_Male),
mean_Female = sum(.$mean_Female*.$N_Female)/sum(.$N_Female),
sd_Male = sum(.$N_Male-1)*.$sd_Male/sum(.$N_Male),
sd_Female = sum(.$N_Female-1)*.$sd_Female/sum(.$N_Female))
但我明白了
Error in vec_rbind(old, new) : Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.
我不完全确定这意味着什么。有没有更简单的方法来计算总平均值和标准差?
更新
我发现了上面的错误。我应该使用.$N_male 和.$N_female。解决了这个错误,但它仍然没有产生我想要的结果。我已经修复了上面的代码。
更新 - 2
如上表所示:
Performance mean_Female mean_Male sd_Female sd_Male N_Female N_Male
Performance-1 4.14 4.10 0.79 0.79 428 896
Performance-2 4.00 3.91 0.87 0.86 427 897
让我们计算grand mean:
Female: ((4.14*428)+(4.00*427))/(428+427)
Male: ((4.10*896)+(3.91*897))/(896+897)
那么对于sd:sqrt(((N1-1)*S1^2+(N2-1)*S2^2+(N3-1)*S3^2)/(N1+N2+N3-3 ))
sd_Female: ((428-1)*0.79+(427-1)*0.87)/(428+427-2)
sd_Female: ((896-1)*0.79+(897-1)*0.86)/(896+897-2)
Performance mean_Female mean_Male sd_Female sd_Male N_Female N_Male
Performance-1 4.14 4.10 0.79 0.79 428 896
Performance-2 4.00 3.91 0.87 0.86 427 897
Grand Mean 4.07 4.00 0.83 0.83
我还不确定如何处理 N_male 和 female,所以我不介意任何一种方式 - null 或一些计算。
【问题讨论】:
-
那么在执行
pivot_wider之后,您是否要添加一个新行,其中平均列的列均值和sd 列的列均值sd? -
@RonakShah,有点像。我认为对于总平均值,每一行的平均值必须由该行的 N 计算,因为 n 是不同的。使用公式
Pooled mean = (N1*M1+N2*M2+N3*M3)/(N1+N2+N3)和Pooled SD ={ (N1-1)*S1+(N2-1)*S2+(N3-1)S3}/(N1+N2+N3-3)。因此,自定义计算。有可能吗? -
您能否使这个问题可重现,添加示例数据并显示预期输出,以便更容易理解您要做什么?
-
data 是我在上面添加的表格。我将使用预期的输出更新问题。
-
我看不到您帖子中的可重现数据在哪里。通过可重现的数据,我的意思是我们可以复制粘贴到我们的 R 会话中并使用它来验证我们的答案,可能使用
dput,在你的情况下是dput(demo)。这是一个很好的指南,告诉你如何做到这一点*.com/questions/5963269
标签: r dplyr pivot-table