【问题标题】:How to calculate grand mean and standard deviation for the grouped table?如何计算分组表的总均值和标准差?
【发布时间】:2020-05-15 23:18:23
【问题描述】:

我有使用以下 dplyr 代码创建的表:

数据

demo <- structure(list(`Performance-1` = c(4, 5, 3, 3, 5, 4, 4, 4, 4, 
4, 5, 4, 5, 5, 3, 5, 2, 3, 3, 4, 4, 5, 4, 3, 4), `Performance-2` = c(4, 
5, 3, 3, 5, 4, 4, 3, 3, 4, 5, 5, 5, 4, 3, 5, 2, 3, 3, 4, 4, 5, 
4, 3, 3), Gender = c("Male", "Female", "Male", "Male", "Male", 
"Female", "Male", "Female", "Male", "Male", NA, "Male", "Male", 
"Male", "Male", "Male", NA, "Female", NA, "Female", "Male", "Male", 
"Male", "Male", NA)), row.names = c(NA, -25L), class = c("tbl_df", 
"tbl", "data.frame"))

这只是我无法访问的主要数据的一个示例。下面的结果可能不同

analysis_vars <- c("Performance-1", "Performance-2")

demo %>% 
  pivot_longer(cols = analysis_vars,names_to = "Performance") %>% 
  select(Performance, value, Gender) %>%
  filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2")) %>%
  group_by(Gender, Performance) %>% 
  summarise(mean = round(mean(value, na.rm=T),2), sd = round(sd(value, na.rm=T),2), N = sum(!is.na(value))) %>%
  pivot_wider(names_from = Gender, values_from = c(mean, sd, N)) 


Performance     mean_Female     mean_Male   sd_Female   sd_Male     N_Female    N_Male
Performance-1   4.14            4.10        0.79        0.79        428         896
Performance-2   4.00            3.91        0.87        0.86        427         897

我想得到一个大均值和大标准偏差以及最后一行,但我无法弄清楚。

当我尝试以下代码时:

demo %>% 
  pivot_longer(cols = analysis_vars,names_to = "Performance") %>% 
  select(Performance, value, Gender) %>%
  filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2")) %>%
  group_by(Gender, Performance) %>% 
  summarise(mean = round(mean(value, na.rm=T),2), sd = round(sd(value, na.rm=T),2), N = sum(!is.na(value))) %>% ungroup() %>%
  add_row(mean = sum(.$mean*.$N)/sum(.$N), sd = sum(.$N-1)*.$sd/sum(.$N)) %>%
  pivot_wider(names_from = Gender, values_from = c(mean, sd, N)) 

我得到的结果是这样的:

Performance     mean_Female     mean_Male   sd_Female   sd_Male     N_Female    N_Male
Performance-1   <dbl [1]>       <dbl [1]>   <NULL>      <dbl [1]>   <dbl [1]>   <NULL>  
Performance-2   <dbl [1]>       <dbl [1]>   <NULL>      <dbl [1]>   <dbl [1]>   <NULL>  

当我移除 pi​​vot_wider(最后一行)以查看发生了什么时,这就是我所看到的。似乎它为两种性别都添加了行。

Gender  Performance     mean        sd          N
Female  Performance-1   4.140000    0.7900000   428
Female  Performance-2   4.000000    0.8700000   427
Male    Performance-1   4.100000    0.7900000   896
Male    Performance-2   3.910000    0.8600000   897
NA      NA              4.025978    0.7888066   NA
NA      NA              4.025978    0.8686858   NA
NA      NA              4.025978    0.7888066   NA
NA      NA              4.025978    0.8587009   NA

所以,我想也许我应该在旋转后这样做:

    demo %>% 
  pivot_longer(cols = analysis_vars,names_to = "Performance") %>% 
  select(Performance, value, Gender) %>%
  filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2")) %>%
  group_by(Gender, Performance) %>% 
  summarise(mean = round(mean(value, na.rm=T),2), sd = round(sd(value, na.rm=T),2), N = sum(!is.na(value))) %>% ungroup() %>% 
  pivot_wider(names_from = Gender, values_from = c(mean, sd, N)) %>% ungroup() %>%
  add_row(mean_Male = sum(.$mean_Male*.$N_Male)/sum(.$N_Male), 
          mean_Female = sum(.$mean_Female*.$N_Female)/sum(.$N_Female),
          sd_Male = sum(.$N_Male-1)*.$sd_Male/sum(.$N_Male),
          sd_Female = sum(.$N_Female-1)*.$sd_Female/sum(.$N_Female)) 

但我明白了

Error in vec_rbind(old, new) : Internal error in `vec_assign()`: `value` should have been recycled to fit `x`.

我不完全确定这意味着什么。有没有更简单的方法来计算总平均值和标准差?

更新

我发现了上面的错误。我应该使用.$N_male.$N_female。解决了这个错误,但它仍然没有产生我想要的结果。我已经修复了上面的代码。

更新 - 2

如上表所示:

Performance     mean_Female     mean_Male   sd_Female   sd_Male     N_Female    N_Male
    Performance-1   4.14            4.10        0.79        0.79        428         896
    Performance-2   4.00            3.91        0.87        0.86        427         897

让我们计算grand mean

Female: ((4.14*428)+(4.00*427))/(428+427)
Male: ((4.10*896)+(3.91*897))/(896+897)

那么对于sd:sqrt(((N1-1)*S1^2+(N2-1)*S2^2+(N3-1)*S3^2)/(N1+N2+N3-3 ))

sd_Female: ((428-1)*0.79+(427-1)*0.87)/(428+427-2)
sd_Female: ((896-1)*0.79+(897-1)*0.86)/(896+897-2)

Performance     mean_Female     mean_Male   sd_Female   sd_Male     N_Female    N_Male
Performance-1   4.14            4.10        0.79        0.79        428         896
Performance-2   4.00            3.91        0.87        0.86        427         897
Grand Mean      4.07            4.00        0.83        0.83        

我还不确定如何处理 N_male 和 female,所以我不介意任何一种方式 - null 或一些计算。

【问题讨论】:

  • 那么在执行pivot_wider 之后,您是否要添加一个新行,其中平均列的列均值和sd 列的列均值sd
  • @RonakShah,有点像。我认为对于总平均值,每一行的平均值必须由该行的 N 计算,因为 n 是不同的。使用公式Pooled mean = (N1*M1+N2*M2+N3*M3)/(N1+N2+N3)Pooled SD ={ (N1-1)*S1+(N2-1)*S2+(N3-1)S3}/(N1+N2+N3-3)。因此,自定义计算。有可能吗?
  • 您能否使这个问题可重现,添加示例数据并显示预期输出,以便更容易理解您要做什么?
  • data 是我在上面添加的表格。我将使用预期的输出更新问题。
  • 我看不到您帖子中的可重现数据在哪里。通过可重现的数据,我的意思是我们可以复制粘贴到我们的 R 会话中并使用它来验证我们的答案,可能使用dput,在你的情况下是dput(demo)。这是一个很好的指南,告诉你如何做到这一点*.com/questions/5963269

标签: r dplyr pivot-table


【解决方案1】:

正如我在 cmets 中提到的,在我们获取宽格式数据之前需要进行计算。在这里我推荐两种方法,您可以选择适合您的一种。

library(dplyr)
library(tidyr)

demo %>% 
   pivot_longer(cols = starts_with('Performance'),names_to = "Performance") %>% 
   select(Performance, value, Gender) %>%
   filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2")) %>%
   group_by(Gender, Performance) %>% 
   summarise(mean = round(mean(value, na.rm=T),2), 
             sd = round(sd(value, na.rm=T),2), 
             N = sum(!is.na(value))) %>%
   mutate(gm = sum(mean * N)/sum(N), 
          gsd = sum((N - 1) * sd)/sum(N - n())) %>%
   pivot_wider(names_from = Gender, values_from = c(mean, sd, N, gm, gsd)) 


# A tibble: 2 x 11
#  Performance   mean_Female mean_Male sd_Female sd_Male N_Female N_Male gm_Female gm_Male gsd_Female gsd_Male
#  <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <int>  <int>     <dbl>   <dbl>      <dbl>    <dbl>
#1 Performance-1         4        4.06      0.71    0.77        5     16       3.9    4.03       1.03    0.852
#2 Performance-2         3.8      4         0.84    0.82        5     16       3.9    4.03       1.03    0.852

在这里,我们可以看到gmgsd 的值在它们各自的列中并重复出现。


第二种更接近预期输出的方法是分两步。

demo %>% 
  pivot_longer(cols = starts_with('Performance'),names_to = "Performance") %>% 
  select(Performance, value, Gender) %>%
  filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2")) %>%
  group_by(Gender, Performance) %>% 
  summarise(mean = round(mean(value, na.rm=T),2), 
            sd = round(sd(value, na.rm=T),2), 
            N = sum(!is.na(value))) -> p


p %>% 
   pivot_wider(names_from = Gender, values_from = c(mean, sd, N)) %>%
   bind_rows(p %>%
              summarise(mean = sum(mean * N)/sum(N), 
              sd = sum((N - 1) * sd)/sum(N - n()), 
              Performance = 'Total') %>%
              pivot_wider(names_from = Gender, values_from = c(mean, sd)))



# Performance   mean_Female mean_Male sd_Female sd_Male N_Female N_Male
#  <chr>               <dbl>     <dbl>     <dbl>   <dbl>    <int>  <int>
#1 Performance-1         4        4.06      0.71   0.77         5     16
#2 Performance-2         3.8      4         0.84   0.82         5     16
#3 Total                 3.9      4.03      1.03   0.852       NA     NA

【讨论】:

  • 谢谢。这看起来不错。我也得出了类似的结论。我在上面发布了我的答案。但我认为您的第二个解决方案更好。
【解决方案2】:

这可能是一种方法,尽管使用 expss 进行计算,然后将输出转换为 data.frame,我认为可以实现您的目标。


library (expss)
library (dplyr)

demo %>% 
tidyr::gather(key,value,-Gender) %>% #get long
tab_cells(value) %>% #variable used for calculations
tab_rows(key,total(label = "Grand mean") %>% #total gets grand total
tab_cols(Gender) %>% #variable for cols
tab_stat_fun(Mean =mean,SD = sd,N = w_n, method =list) %>% #calculations
tab_pivot()%>% #makes a table
data.frame() %>% # convert to df
select(c(1,2,5,3,6,4,7)) -> out #order cols

#tidy up names
colnames(out) <-gsub("Gender[.]","",colnames(out))
colnames(out)[1] <- "Performance"
out

【讨论】:

    【解决方案3】:

    经过多次试错和思考,我找到了一种似乎可行的解决方案。我仍然欢迎一个优雅的解决方案:

    p2 <- demo %>% pivot_longer(cols = analysis_vars, names_to = "Performance") %>% 
        select(Performance, value, !!var) %>%
        filter(!is.na(!!var), Performance %in% c("Performance-1", "Performance-2")) %>%
        group_by(!!var, Performance) %>% 
        summarise(mean = round(mean(value, na.rm=T),2), sd = round(sd(value, na.rm=T),2), N = sum(!is.na(value))) %>% 
        mutate(gm = round(sum(mean*N)/sum(N),2),
               gsd = round(sqrt(sum((N-1)*sd^2)/sum(N-5)),2)) %>%
        pivot_wider(names_from = !!var, values_from = c(mean, sd, N, gm, gsd))
    
    
      g <- p2 %>% select(matches("gm_|gsd_"))
    
      n <- g %>% rename_all(funs(str_replace(., "gm_", "mean_"))) %>% 
        rename_all(funs(str_replace(., "gsd_", "sd_"))) %>% 
        summarise_all(mean, na.rm=T) %>% 
        add_column(Item = "Grand Mean/SD", .before = 1)
    
    p2 <- p2 %>% 
              bind_rows(n) %>%           
              select(-starts_with("gm_"), -starts_with("gsd_"))
    

    到目前为止,这是我能够提出的唯一方法。

    我需要以这种方式将 Excel 电子表格作为表格放入其中。

    【讨论】:

      【解决方案4】:

      现在我更好地理解了你的愿望。我仍然认为让一个现有的软件包来完成这项工作是明智的......

      library(tables)
      
      tabular( table = (Species + 1) ~ (n = 1) + Format(digits = 2) * (Sepal.Length + Sepal.Width + Petal.Width + Petal.Length) * (mean + sd), 
               data = iris )
      #>                                                                    
      #>                 Sepal.Length      Sepal.Width      Petal.Width     
      #>  Species    n   mean         sd   mean        sd   mean        sd  
      #>  setosa      50 5.01         0.35 3.43        0.38 0.25        0.11
      #>  versicolor  50 5.94         0.52 2.77        0.31 1.33        0.20
      #>  virginica   50 6.59         0.64 2.97        0.32 2.03        0.27
      #>  All        150 5.84         0.83 3.06        0.44 1.20        0.76
      #>                   
      

      【讨论】:

        【解决方案5】:

        这是我通常为此类问题选择的另一种tidyverse 方法。它基于创建嵌套 tibble 以及过滤器表达式列表。最后一个过滤器表达式是1 &gt; 0,其中所有数据都包含在“大均值”中。对于您手头的问题,这种方法可能过于冗长,但是当您有更多过滤条件时,特别是在处理数据的不同子集时,或者当您有许多或更复杂的汇总统计信息时,这种方法应该比两者都更灵活add_rowtabular 方法。

        library(tidyverse)
        
        # your data
        demo <- structure(list(`Performance-1` = c(4, 5, 3, 3, 5, 4, 4, 4, 4,  4, 5, 4, 5, 5, 3, 5, 2, 3, 3, 4, 4, 5, 4, 3, 4),
                               `Performance-2` = c(4, 5, 3, 3, 5, 4, 4, 3, 3, 4, 5, 5, 5, 4, 3, 5, 2, 3, 3, 4, 4, 5, 4, 3, 3),
                                Gender = c("Male", "Female", "Male", "Male", "Male", "Female", "Male", "Female", "Male", "Male",
                                           NA, "Male", "Male", "Male", "Male", "Male", NA, "Female", NA, "Female", "Male", "Male",
                                           "Male", "Male", NA)),
                          row.names = c(NA, -25L),
                          class = c("tbl_df", "tbl", "data.frame"))
        
        analysis_vars <- c("Performance-1", "Performance-2")
        
        demo_dat <- demo %>% 
          pivot_longer(cols = analysis_vars,names_to = "Performance") %>% 
          select(Performance, value, Gender) %>%
          filter(!is.na(Gender), Performance %in% c("Performance-1", "Performance-2"))
        
        # From here new approach
        res <- tibble(filter_expr = list(Male = expr(Gender == "Male"),
                                         Female = expr(Gender == "Female"),
                                        `Grand Mean`= expr(1 > 0))) %>% 
                crossing(data = list(demo_dat)) %>% 
                 mutate(id = names(filter_expr),
                        data = map2(data,
                                    filter_expr,
                                    ~ .x %>% filter(eval(.y)) %>% 
                                      group_by(Performance) %>% 
                                      summarise(mean = round(mean(value, na.rm = T), 2), 
                                                sd = round(sd(value, na.rm = T), 2), 
                                                N = sum(!is.na(value))))) %>% 
              select(-filter_expr) %>% 
              unnest(cols = data) %>% 
          pivot_wider(names_from = "Performance", values_from = c(mean, sd, N)) 
        
        res
        #> # A tibble: 3 x 7
        #>   id    `mean_Performan… `mean_Performan… `sd_Performance… `sd_Performance…
        #>   <chr>            <dbl>            <dbl>            <dbl>            <dbl>
        #> 1 Male              4.06             4                0.77             0.82
        #> 2 Fema…             4                3.8              0.71             0.84
        #> 3 Gran…             4.05             3.95             0.74             0.8 
        #> # … with 2 more variables: `N_Performance-1` <int>, `N_Performance-2` <int>
        

        reprex package (v0.3.0) 于 2020-05-17 创建

        【讨论】: