按除组以外的所有变量汇总变量答案

【问题标题】：Summarize a Variable by All But Group按除组以外的所有变量汇总变量
【发布时间】：2020-05-25 22:37:18
【问题描述】：

我有一个 data.frame，我需要计算每个“反组”的平均值（即下面的每个名称）。

Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32

我想要的输出如下所示，其中 Rate1 和 Rate2 的值是每组中未找到的列值的平均值。请忽略该值，我已经为示例做了它。如果可能的话，我更愿意使用 dplyr 来做到这一点。

Name    Rate1   Rate2
Aira    38      52.2
Ben     30.5    50.5
Cat     23.8    48.7

非常感谢任何帮助！谢谢！

PS - 感谢 Ianthe 复制了他们的问题和问题的数据，但稍微更改了问题。 (Mean per group in a data.frame)

【问题讨论】：

你尝试了什么？您正在标记dplyr，那么您是否尝试过summarise_all、summarise_at 等...？
请认真尝试，以便我们帮助实施并帮助您提高学习曲线。
好吧，如果我想要每个组，我可以轻松地执行以下操作：df %>% group_by(Name) %>% summarize(Rate1=mean(Rate1), Rate2=mean(Rate2)) 但这会按组计算 Rate 列的平均值。我想计算所有但组的比率列的平均值。

标签： r dplyr summarize

【解决方案1】：

这是基于 R 的另一个想法，

do.call(rbind, lapply(unique(df$Name), function(i)colMeans(df[!df$Name %in% i,-c(1:2)])))

#        Rate1    Rate2
#[1,] 38.00000 52.16667
#[2,] 30.50000 50.50000
#[3,] 23.83333 48.66667

或填写Name，

cbind.data.frame(Name = unique(df$Name), res1)

#  Name    Rate1    Rate2
#1 Aira 38.00000 52.16667
#2  Ben 30.50000 50.50000
#3  Cat 23.83333 48.66667

【讨论】：

【解决方案2】：

一个选项可能是：

df %>%
 mutate_at(vars(Rate1, Rate2), list(sum = ~ sum(.))) %>%
 mutate(rows = n()) %>%
 group_by(Name) %>%
 summarise(Rate1 = first((Rate1_sum - sum(Rate1))/(rows-n())),
           Rate2 = first((Rate2_sum - sum(Rate2))/(rows-n())))

  Name  Rate1 Rate2
  <chr> <dbl> <dbl>
1 Aira   38    52.2
2 Ben    30.5  50.5
3 Cat    23.8  48.7

或者以不那么整洁的形式：

df %>%
 group_by(Name) %>%
 summarise(Rate1 = first((sum(df$Rate1) - sum(Rate1))/(nrow(df)-n())),
           Rate2 = first((sum(df$Rate2) - sum(Rate2))/(nrow(df)-n())))

【讨论】：

谢谢！我认为您的解决方案是最干净和最通用的。再次感谢。
存在 NA 时可能会有点问题，但可以调整 :)

【解决方案3】：

library(tidyverse)

# exampel dataset
df = read.table(text = "
Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32
", header=T, stringsAsFactors=F)

# function that returns means of Rates after excluding a given name
AntiGroupMean = function(x) { df %>% filter(Name != x) %>% summarise_at(vars(matches("Rate")), mean) }

df %>%
  distinct(Name) %>%                         # for each name
  mutate(v = map(Name, AntiGroupMean)) %>%   # apply the function
  unnest(v)                                  # unnest results

# # A tibble: 3 x 3
#   Name  Rate1 Rate2
#   <chr> <dbl> <dbl>
# 1 Aira   38    52.2
# 2 Ben    30.5  50.5
# 3 Cat    23.8  48.7

【讨论】：

【解决方案4】：

您可以将其计算为组均值的平均值，由每个组中的观察数加权，但给定行的权重等于 0。

library(dplyr)

df %>% 
  group_by(Name) %>% 
  summarise(n = n(), Rate1 = mean(Rate1), Rate2 = mean(Rate2)) %>% 
  mutate_at(vars(starts_with('Rate')),  ~
    sapply(Name, function(x) weighted.mean(.x, n*(Name != x))))

# A tibble: 3 x 4
  Name      n Rate1 Rate2
  <chr> <int> <dbl> <dbl>
1 Aira      3  38    52.2
2 Ben       3  30.5  50.5
3 Cat       3  23.8  48.7

【讨论】：

【解决方案5】：

你可以试试：

library(dplyr)

df %>%
  mutate_at(
    vars(contains('Rate')),
    ~ sapply(1:n(), function(x) mean(.[Name %in% setdiff(unique(df$Name), Name[x])], na.rm = TRUE)
             )
  ) %>%
  distinct_at(vars(-Month))

输出：

  Name    Rate1    Rate2
1 Aira 38.00000 52.16667
2  Ben 30.50000 50.50000
3  Cat 23.83333 48.66667

（尽管使用其他解决方案可能会更好，因为 sapply 通过行在较大的数据集上会非常慢）

【讨论】：

【解决方案6】：

我们可以使用

library(dplyr)
library(purrr)
map_dfr(unique(df1$Name), ~ 
   anti_join(df1, tibble(Name = .x)) %>% 
   summarise_at(vars(starts_with('Rate')), mean) %>%
   mutate(Name = .x)) %>%
   select(Name, everything())
#    Name    Rate1    Rate2
#1 Aira 38.00000 52.16667
#2  Ben 30.50000 50.50000
#3  Cat 23.83333 48.66667

数据

df1 <- structure(list(Name = c("Aira", "Aira", "Aira", "Ben", "Ben", 
"Ben", "Cat", "Cat", "Cat"), Month = c(1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L), Rate1 = c(12L, 18L, 19L, 53L, 22L, 19L, 22L, 67L, 
45L), Rate2 = c(23L, 73L, 45L, 19L, 87L, 45L, 87L, 43L, 32L)), 
 class = "data.frame", row.names = c(NA, 
-9L))

【讨论】：

抱歉，此代码创建的 tibble 与我想要的解决方案不匹配。