【问题标题】:Using summarize(across(..., .fns = ...)) with a multi-variate function将 summarise(across(..., .fns = ...)) 与多变量函数一起使用
【发布时间】:2021-12-25 20:23:09
【问题描述】:

我的问题需要我跨多列汇总数据,但每列必须由其他三列的多变量函数汇总。

我有一个包含数百列的数据框,其中包含有关数据集的不同统计信息。这是一个类似结构的较小数据框。

df <- data.frame(a1_Avg = rnorm(10), 
                 a1_Std = runif(10), 
                 a2_Avg = rnorm(10), 
                 a2_Std = runif(10), 
                 Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                 Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2)) %>%

数据需要压缩成行来总结一小时的数据块。总结平均值很容易:我可以简单地平均它们,因为一个小时内的测量次数是一致的。

  group_by(Hour) %>%
  summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),

但是组合标准差比较棘手,因为我需要来自三个单独列的信息来计算它。手动,我会这样做:

            combined_a1_Std = sqrt((1/n())*sum(a1_Std^2 + (a1_Avg - combined_a1_Avg)^2)),
            combined_a2_Std = sqrt((1/n())*sum(a2_Std^2 + (a2_Avg - combined_a2_Avg)^2)))

但这对于数百列是不可行的。

有没有简单的方法来做到这一点?

这是上面的完整代码,以及所需的输出:

set.seed(1)
df <- data.frame(a1_Avg = rnorm(10), 
                 a1_Std = runif(10), 
                 a2_Avg = rnorm(10), 
                 a2_Std = runif(10), 
                 Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                 Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2)) %>%
  mutate(Hour = floor(Hour)) %>%
  group_by(Hour) %>%
  summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
            combined_a1_Std = sqrt((1/n())*sum(a1_Std^2 + (a1_Avg - combined_a1_Avg)^2)),
            combined_a2_Std = sqrt((1/n())*sum(a2_Std^2 + (a2_Avg - combined_a2_Avg)^2)))

df

   Hour combined_a1_Avg combined_a2_Avg combined_a1_Std combined_a2_Std
  <dbl>           <dbl>           <dbl>           <dbl>           <dbl>
1     1         -0.221          -0.0306           0.859           0.859
2     2          0.0672          0.819            1.17            1.17 
3     3          0.487           0.782            0.116           0.116
4     4          0.657          -0.957            0.795           0.795
5     5         -0.305           0.620            0.583           0.583

【问题讨论】:

    标签: r dplyr tidyverse purrr


    【解决方案1】:

    一种选择是循环遍历一组列,然后通过替换列名中的子字符串 get 另一组

    library(dplyr)
    library(stringr)
    out2 <- df %>% 
       mutate(Hour = floor(Hour)) %>%
       group_by(Hour) %>%
       summarize(across(matches("a\\d+_Avg"), ~ mean(.x),
        .names = "combined_{col}"), 
             across(matches('^a\\d+_Avg$'),
         ~ sqrt((1/n())*sum(get(str_replace(cur_column(), "Avg", "Std")) +
                       (. - get(str_c( "combined_", cur_column() )))^2)), 
          .names = "combined_{str_replace(.col, 'Avg', 'Std')}"))
    

    -使用 OP 的手动方法检查

    out1 <- df %>%
       mutate(Hour = floor(Hour)) %>%
      group_by(Hour) %>%
      summarize(across(matches("a._Avg"), ~ mean(.x), .names = "combined_{col}"),
                combined_a1_Std = sqrt((1/n())*sum(a1_Std + (a1_Avg - combined_a1_Avg)^2)),
                combined_a2_Std = sqrt((1/n())*sum(a2_Std + (a2_Avg - combined_a2_Avg)^2)))
    identical(out1, out2)
    [1] TRUE
    

    数据

    set.seed(1)
    df <- data.frame(a1_Avg = rnorm(10), 
                     a1_Std = runif(10), 
                     a2_Avg = rnorm(10), 
                     a2_Std = runif(10), 
                     Hour = c(1.0, 1.5, 2.0, 2.25, 2.5, 2.75, 3.0, 4.0, 4.5, 5.0),
                     Measurements = c(3, 3, 6, 6, 6, 6, 10, 7, 7, 2))
    

    【讨论】:

    • 第一个across 应该包含'a\\d+.Avg' 而不是'a._Avg'
    • @jpdugo17 你是对的。这是OP的代码。虽然它在这里可以工作,因为. 可以匹配任何字符并且只有一个数字
    • 谢谢!这就是我一直在寻找的
    猜你喜欢
    • 2021-02-22
    • 2021-05-16
    • 2021-04-20
    • 1970-01-01
    • 1970-01-01
    • 2020-12-07
    • 1970-01-01
    • 1970-01-01
    • 2018-11-20
    相关资源
    最近更新 更多