【发布时间】:2019-02-04 23:24:43
【问题描述】:
为不太清楚的标题道歉(可以使用帮助)-希望下面的示例可以澄清很多事情。我有以下篮球投篮结果数据框(1 行 == 1 次篮球投篮):
> dput(zed)
structure(list(shooterTeamAlias = c("DUKE", "DUKE", "BC", "DUKE",
"DUKE", "DUKE", "DUKE", "DUKE", "DUKE", "BC", "BC", "BC", "DUKE",
"BC", "BC", "DUKE", "DUKE", "DUKE", "BC", "DUKE"), distanceCategory = c("sht2",
"sht2", "sht3", "atr2", "mid2", "sht2", "lng3", "sht3", "atr2",
"sht3", "sht3", "sht2", "mid2", "sht3", "sht3", "sht3", "atr2",
"atr2", "sht2", "mid2"), eventType = c("twopointmiss", "twopointmade",
"threepointmade", "twopointmade", "twopointmiss", "twopointmade",
"threepointmiss", "threepointmiss", "twopointmade", "threepointmiss",
"threepointmade", "twopointmiss", "twopointmade", "threepointmiss",
"threepointmade", "threepointmiss", "twopointmade", "twopointmade",
"twopointmade", "twopointmade")), row.names = c(NA, 20L), class = "data.frame")
> zed
shooterTeamAlias distanceCategory eventType
1 DUKE sht2 twopointmiss
2 DUKE sht2 twopointmade
3 BC sht3 threepointmade
4 DUKE atr2 twopointmade
5 DUKE mid2 twopointmiss
6 DUKE sht2 twopointmade
7 DUKE lng3 threepointmiss
8 DUKE sht3 threepointmiss
9 DUKE atr2 twopointmade
10 BC sht3 threepointmiss
11 BC sht3 threepointmade
12 BC sht2 twopointmiss
13 DUKE mid2 twopointmade
14 BC sht3 threepointmiss
15 BC sht3 threepointmade
16 DUKE sht3 threepointmiss
17 DUKE atr2 twopointmade
18 DUKE atr2 twopointmade
19 BC sht2 twopointmade
20 DUKE mid2 twopointmade
这个数据框目前是一个整洁的格式,我需要 group_by 团队,然后把它肥大。完整数据有 6 个 distanceCategories atr2, sht2, mid2, lng2, sht3, lng3(上面的示例只有 5 个),以及 2 个类别是其他 6 个类别的函数:all2 是 atr2, sht2, lng2, mid2 和 all3 是 sht3, lng3。然后,对于这 8 个类别中的每一个,我想要一个列,用于表示制造、尝试、pct 和尝试频率。我使用eventType 列来确定是否进行了拍摄。我目前正在这样做
fat.data <- {zed %>%
dplyr::group_by(shooterTeamAlias) %>%
dplyr::summarise(
shotsCount = n(),
# Shooting By Distance Stats
atr2Made = sum(distanceCategory == "atr2" & eventType == "twopointmade"),
atr2Att = sum(distanceCategory == "atr2" & eventType %in% c("twopointmiss", "twopointmade")),
atr2AttFreq = atr2Att / shotsCount,
atr2Pct = ifelse(atr2Att > 0, atr2Made / atr2Att, 0),
sht2Made = sum(distanceCategory == "sht2" & eventType == "twopointmade"),
sht2Att = sum(distanceCategory == "sht2" & eventType %in% c("twopointmiss", "twopointmade")),
sht2AttFreq = sht2Att / shotsCount,
sht2Pct = ifelse(sht2Att > 0, sht2Made / sht2Att, 0),
mid2Made = sum(distanceCategory == "mid2" & eventType == "twopointmade"),
mid2Att = sum(distanceCategory == "mid2" & eventType %in% c("twopointmiss", "twopointmade")),
mid2AttFreq = mid2Att / shotsCount,
mid2Pct = ifelse(mid2Att > 0, mid2Made / mid2Att, 0),
lng2Made = sum(distanceCategory == "lng2" & eventType == "twopointmade"),
lng2Att = sum(distanceCategory == "lng2" & eventType %in% c("twopointmiss", "twopointmade")),
lng2AttFreq = lng2Att / shotsCount,
lng2Pct = ifelse(lng2Att > 0, lng2Made / lng2Att, 0),
all2Made = sum(atr2Made, sht2Made, mid2Made, lng2Made),
all2Att = sum(atr2Att, sht2Att, mid2Att, lng2Att),
all2AttFreq = all2Att / shotsCount,
all2Pct = ifelse(all2Att > 0, all2Made / all2Att, 0),
sht3Made = sum(distanceCategory == "sht3" & eventType == "threepointmade"),
sht3Att = sum(distanceCategory == "sht3" & eventType %in% c("threepointmiss", "threepointmade")),
sht3AttFreq = sht3Att / shotsCount,
sht3Pct = ifelse(sht3Att > 0, sht3Made / sht3Att, 0),
lng3Made = sum(distanceCategory == "lng3" & eventType == "threepointmade"),
lng3Att = sum(distanceCategory == "lng3" & eventType %in% c("threepointmiss", "threepointmade")),
lng3AttFreq = lng3Att / shotsCount,
lng3Pct = ifelse(lng3Att > 0, lng3Made / lng3Att, 0),
all3Made = sum(sht3Made, lng3Made),
all3Att = sum(sht3Att, lng3Att),
all3AttFreq = all3Att / shotsCount,
all3Pct = ifelse(all3Att > 0, all3Made / all3Att, 0))}
...对于数据中出现的 6 个类别(除all2 和all3 之外的所有类别),它们的 4 列都以相同的方式计算。正如您将看到的all2 和all3,计算方式有些不同。
暂时不用担心all2 和all3 类别,有没有更好的方法来计算数据中6 个类别的制造、尝试、百分比和尝试频率?对于这里的 8 个类别 * 4 列类型 == 32 列,还不错,但是我有另一个类似的实例,其中我有 21 个类别 * 4 列类型,并且我必须在我的代码中多次执行此操作。
不确定dplyr::group_by dplyr::summarise 是否是我最好的选择(obv 这是我目前使用的),或者是否有更好的方法来解决这个问题。改进此代码/可能为我的项目加速它至关重要,任何帮助表示赞赏/即使在接下来的 2 天内得到答复,我也会尽量记住赏金这篇文章。
编辑!!! : 我刚刚意识到,首先按 distanceCategory 分组,计算每个 distanceCategory 的 4 个统计数据,然后将 那个 数据帧重新构建成这种胖格式可能更容易......是我目前正在计算的东西。大致如下:
zed %>%
dplyr::group_by(shooterTeamAlias, distanceCategory) %>%
dplyr::summarise(
attempts = ...,
makes = ...,
pct = ...,
attfreq = ...
) %>%
tidyr::spread(...)
谢谢!!
【问题讨论】:
标签: r group-by dplyr tidyr data-manipulation