在单列上使用 dplyr 的汇总，但具有多个参数值答案

【问题标题】：Using dplyr's summarise on single column, but with multiple parameter values在单列上使用 dplyr 的汇总，但具有多个参数值
【发布时间】：2019-02-04 23:24:43
【问题描述】：

为不太清楚的标题道歉（可以使用帮助）-希望下面的示例可以澄清很多事情。我有以下篮球投篮结果数据框（1 行 == 1 次篮球投篮）：

> dput(zed)
structure(list(shooterTeamAlias = c("DUKE", "DUKE", "BC", "DUKE", 
"DUKE", "DUKE", "DUKE", "DUKE", "DUKE", "BC", "BC", "BC", "DUKE", 
"BC", "BC", "DUKE", "DUKE", "DUKE", "BC", "DUKE"), distanceCategory = c("sht2", 
"sht2", "sht3", "atr2", "mid2", "sht2", "lng3", "sht3", "atr2", 
"sht3", "sht3", "sht2", "mid2", "sht3", "sht3", "sht3", "atr2", 
"atr2", "sht2", "mid2"), eventType = c("twopointmiss", "twopointmade", 
"threepointmade", "twopointmade", "twopointmiss", "twopointmade", 
"threepointmiss", "threepointmiss", "twopointmade", "threepointmiss", 
"threepointmade", "twopointmiss", "twopointmade", "threepointmiss", 
"threepointmade", "threepointmiss", "twopointmade", "twopointmade", 
"twopointmade", "twopointmade")), row.names = c(NA, 20L), class = "data.frame")

> zed
   shooterTeamAlias distanceCategory      eventType
1              DUKE             sht2   twopointmiss
2              DUKE             sht2   twopointmade
3                BC             sht3 threepointmade
4              DUKE             atr2   twopointmade
5              DUKE             mid2   twopointmiss
6              DUKE             sht2   twopointmade
7              DUKE             lng3 threepointmiss
8              DUKE             sht3 threepointmiss
9              DUKE             atr2   twopointmade
10               BC             sht3 threepointmiss
11               BC             sht3 threepointmade
12               BC             sht2   twopointmiss
13             DUKE             mid2   twopointmade
14               BC             sht3 threepointmiss
15               BC             sht3 threepointmade
16             DUKE             sht3 threepointmiss
17             DUKE             atr2   twopointmade
18             DUKE             atr2   twopointmade
19               BC             sht2   twopointmade
20             DUKE             mid2   twopointmade

这个数据框目前是一个整洁的格式，我需要 group_by 团队，然后把它肥大。完整数据有 6 个 distanceCategories atr2, sht2, mid2, lng2, sht3, lng3（上面的示例只有 5 个），以及 2 个类别是其他 6 个类别的函数：all2 是 atr2, sht2, lng2, mid2 和 all3 是 sht3, lng3。然后，对于这 8 个类别中的每一个，我想要一个列，用于表示制造、尝试、pct 和尝试频率。我使用eventType 列来确定是否进行了拍摄。我目前正在这样做

fat.data <- {zed %>%
    dplyr::group_by(shooterTeamAlias) %>%
    dplyr::summarise(

      shotsCount = n(),
      # Shooting By Distance Stats
      atr2Made = sum(distanceCategory == "atr2" & eventType == "twopointmade"),
      atr2Att = sum(distanceCategory == "atr2" & eventType %in% c("twopointmiss", "twopointmade")),
      atr2AttFreq = atr2Att / shotsCount,
      atr2Pct = ifelse(atr2Att > 0, atr2Made / atr2Att, 0),

      sht2Made = sum(distanceCategory == "sht2" & eventType == "twopointmade"),
      sht2Att = sum(distanceCategory == "sht2" & eventType %in% c("twopointmiss", "twopointmade")),
      sht2AttFreq = sht2Att / shotsCount, 
      sht2Pct = ifelse(sht2Att > 0, sht2Made / sht2Att, 0),

      mid2Made = sum(distanceCategory == "mid2" & eventType == "twopointmade"),
      mid2Att = sum(distanceCategory == "mid2" & eventType %in% c("twopointmiss", "twopointmade")),
      mid2AttFreq = mid2Att / shotsCount,
      mid2Pct = ifelse(mid2Att > 0, mid2Made / mid2Att, 0),

      lng2Made = sum(distanceCategory == "lng2" & eventType == "twopointmade"),
      lng2Att = sum(distanceCategory == "lng2" & eventType %in% c("twopointmiss", "twopointmade")),
      lng2AttFreq = lng2Att / shotsCount,
      lng2Pct = ifelse(lng2Att > 0, lng2Made / lng2Att, 0),

      all2Made = sum(atr2Made, sht2Made, mid2Made, lng2Made),
      all2Att = sum(atr2Att, sht2Att, mid2Att, lng2Att),
      all2AttFreq = all2Att / shotsCount,
      all2Pct = ifelse(all2Att > 0, all2Made / all2Att, 0),

      sht3Made = sum(distanceCategory == "sht3" & eventType == "threepointmade"),
      sht3Att = sum(distanceCategory == "sht3" & eventType %in% c("threepointmiss", "threepointmade")),
      sht3AttFreq = sht3Att / shotsCount,
      sht3Pct = ifelse(sht3Att > 0, sht3Made / sht3Att, 0),

      lng3Made = sum(distanceCategory == "lng3" & eventType == "threepointmade"),
      lng3Att = sum(distanceCategory == "lng3" & eventType %in% c("threepointmiss", "threepointmade")),
      lng3AttFreq = lng3Att / shotsCount,
      lng3Pct = ifelse(lng3Att > 0, lng3Made / lng3Att, 0),

      all3Made = sum(sht3Made, lng3Made),
      all3Att = sum(sht3Att, lng3Att),
      all3AttFreq = all3Att / shotsCount,
      all3Pct = ifelse(all3Att > 0, all3Made / all3Att, 0))}

...对于数据中出现的 6 个类别（除all2 和all3 之外的所有类别），它们的 4 列都以相同的方式计算。正如您将看到的all2 和all3，计算方式有些不同。

暂时不用担心all2 和all3 类别，有没有更好的方法来计算数据中6 个类别的制造、尝试、百分比和尝试频率？对于这里的 8 个类别 * 4 列类型 == 32 列，还不错，但是我有另一个类似的实例，其中我有 21 个类别 * 4 列类型，并且我必须在我的代码中多次执行此操作。

不确定dplyr::group_by dplyr::summarise 是否是我最好的选择（obv 这是我目前使用的），或者是否有更好的方法来解决这个问题。改进此代码/可能为我的项目加速它至关重要，任何帮助表示赞赏/即使在接下来的 2 天内得到答复，我也会尽量记住赏金这篇文章。

编辑！！！ : 我刚刚意识到，首先按 distanceCategory 分组，计算每个 distanceCategory 的 4 个统计数据，然后将那个数据帧重新构建成这种胖格式可能更容易......是我目前正在计算的东西。大致如下：

zed %>% 
  dplyr::group_by(shooterTeamAlias, distanceCategory) %>%
  dplyr::summarise(
    attempts = ...,
    makes = ...,
    pct = ...,
    attfreq = ...
  ) %>%
  tidyr::spread(...)

谢谢！！

【问题讨论】：

标签： r group-by dplyr tidyr data-manipulation

【解决方案1】：

这看起来可以通过按 distanceCategory 分组然后对每个应用相同的逻辑来简化：

library(tidyverse)
zed %>%
  group_by(shooterTeamAlias, distanceCategory) %>%
  summarize(att = n(),   # n() counts how many rows in this group
            made = sum(eventType %>% str_detect("made"))
            pct = if_else(att > 0, made / att, 0)) %>%
  mutate(freq = att / sum(att))

# A tibble: 7 x 6
# Groups:   shooterTeamAlias [2]
  shooterTeamAlias distanceCategory   att  made   pct   freq
  <chr>            <chr>            <int> <int> <dbl>  <dbl>
1 BC               sht2                 2     1 0.5   0.286 
2 BC               sht3                 5     3 0.6   0.714 
3 DUKE             atr2                 4     4 1     0.308 
4 DUKE             lng3                 1     0 0     0.0769
5 DUKE             mid2                 3     2 0.667 0.231 
6 DUKE             sht2                 3     2 0.667 0.231 
7 DUKE             sht3                 2     0 0     0.154

如果你想要宽格式，你可以先收集上面的计算，将距离与统计数据结合起来，然后再传播：

[same code as above] %>%
  gather(stat, value, -distanceCategory, -shooterTeamAlias) %>%
  unite(stat, distanceCategory, stat) %>%
  spread(stat, value)

# A tibble: 2 x 21
# Groups:   shooterTeamAlias [2]
  shooterTeamAlias atr2_att atr2_freq atr2_made atr2_pct lng3_att lng3_freq lng3_made lng3_pct mid2_att mid2_freq mid2_made mid2_pct sht2_att sht2_freq sht2_made sht2_pct sht3_att sht3_freq sht3_made sht3_pct
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
1 BC                     NA    NA            NA       NA       NA   NA             NA       NA       NA    NA            NA   NA            2     0.286         1    0.5          5     0.714         3      0.6
2 DUKE                    4     0.308         4        1        1    0.0769         0        0        3     0.231         2    0.667        3     0.231         2    0.667        2     0.154         0      0

【讨论】：