如何对两种不同类型的数据应用汇总函数答案

【问题标题】：How to apply summary function on two different types of data如何对两种不同类型的数据应用汇总函数
【发布时间】：2025-12-22 21:40:10
【问题描述】：

我有多个变量的数据框，一些变量只包含 0 和 1，而其他列包含所有可能的值。
如何总结 df 列仅包含 0 和 1 与“sts_1=sum(sts_1*0.25,na.rm=T)”和其他列与“non_sts_3=mean(non_sts_3,na.rm = T)”输出指定列名。

df <- data.frame(year=c("2014","2014","2015","2015","2015"),
                 month_=c("Jan","Jan","Jan","Jan","Feb"),
                 sts_1=c(0,1,1,1,0),
                 sts_2=c(1,0,0,1,NA),
                 non_sts_1=c(0,3,7,31,10),
                 non_sts_2=c(1,4,NA,12,6),
                 non_sts_3 = c(12,14,18,1,9))

我们可以通过 dplyr 使用以下代码手动输入列名来完成

df<-group_by(df,year, month_)

df_aggregation<-summarise(df,
                          non_sts_1=mean(non_sts_1,na.rm = T),
                          non_sts_2=mean(non_sts_2,na.rm = T),
                          non_sts_3=mean(non_sts_3,na.rm = T),
                          sts_1=sum(sts_1*0.25,na.rm=T),
                          sts_2=sum(sts_2*0.25,na.rm=T))

提前谢谢...

【问题讨论】：

假设>1 足以确定差异是否安全？还是必须寻找 0,1 以外的值？
感谢 r2evans，其他列也可能包含 0 和 1，应该是 sum 公式列仅包含 0 和 1

标签： r

【解决方案1】：

@akrun 的回答直截了当。但是，如果您不想进行不必要的计算，则可以定义一个直接进行判别的函数：

library(dplyr)
mysumm <- function(x, na.rm = FALSE) {
  if (all(x %in% 0:1)) {
    sum(x * 0.25, na.rm = na.rm)
  } else {
    mean(x, na.rm = na.rm)
  }
}

df %>%
  group_by(year, month_) %>%
  summarise_if(is.numeric, mysumm, na.rm = TRUE)
# # A tibble: 3 x 7
# # Groups:   year [?]
#     year month_ sts_1 sts_2 non_sts_1 non_sts_2 non_sts_3
#   <fctr> <fctr> <dbl> <dbl>     <dbl>     <dbl>     <dbl>
# 1   2014    Jan  0.25  0.25       1.5       2.5      13.0
# 2   2015    Feb  0.00   NaN      10.0       6.0       9.0
# 3   2015    Jan  0.50  0.25      19.0      12.0       9.5

【讨论】：

【解决方案2】：

我们可以使用summarise_all，然后删除多余的列

df %>% 
  group_by(year, month_) %>% 
  summarise_all(funs(mean(., na.rm = TRUE), sum(.*0.25, na.rm = TRUE))) %>%
  select(matches("month_|non_sts.*mean|\\bsts.*sum"))
# A tibble: 3 x 7
# Groups:   year [2]
#    year month_ non_sts_1_mean non_sts_2_mean non_sts_3_mean sts_1_sum sts_2_sum
#    <fctr> <fctr>          <dbl>          <dbl>          <dbl>     <dbl>     <dbl>
#1   2014    Jan            1.5            2.5           13.0      0.25      0.25
#2   2015    Feb           10.0            6.0            9.0      0.00      0.00
#3   2015    Jan           19.0           12.0            9.5      0.50      0.25

如果我们有多组函数要应用于不同的列集，另一种方法是将函数分别应用于不同的列块然后连接

library(tidyverse)
flist <- list(function(x) mean(x, na.rm = TRUE), function(x) sum(x*0.25, na.rm = TRUE))
nm1 <- c("^non_sts", "^sts")
map2(nm1, flist, ~df %>%
                    group_by(year, month_) %>% 
                    summarise_at(vars(matches(.x)), funs(.y))) %>% 
                    reduce(inner_join, by = c('year', 'month_'))
# A tibble: 3 x 7
# Groups:   year [?]
#     year month_ non_sts_1 non_sts_2 non_sts_3 sts_1 sts_2
#   <fctr> <fctr>     <dbl>     <dbl>     <dbl> <dbl> <dbl>
#1   2014    Jan       1.5       2.5      13.0  0.25  0.25
#2   2015    Feb      10.0       6.0       9.0  0.00  0.00
#3   2015    Jan      19.0      12.0       9.5  0.50  0.25

注意：这种方法可以灵活地用于任何列集

如果我们要修改 0:1 案例的方法

l1 <- df %>% 
         summarise_at(3:7, funs(all(. %in% c(0, 1, NA)))) %>% 
         unlist
nm1 <- split(names(df)[-(1:2)], l1)

然后通过删除matches如上所述应用

【讨论】：

感谢 akrun 的帮助，如果我的列名发生变化怎么办……我们如何考虑只包含 0 和 1 的列的求和公式以及其他列的平均公式
@sasir 我认为你的函数是基于列名的，即sts vs non_sts