【问题标题】:Using dplyr to count multiple group-by variables使用 dplyr 计算多个分组变量
【发布时间】:2019-04-26 14:03:01
【问题描述】:

我有一个包含多个分类变量的数据集

data <- data_frame(
HomeTeam = c("Team1", "Team2", "Team3", "Team4", "Team2", "Team2", "Team4", 
             "Team3", "Team2", "Team1", "Team3", "Team2"),
AwayTeam = c("Team2", "Team1", "Team4", "Team3", "Team1", "Team4", "Team1", 
             "Team2", "Team3", "Team3", "Team4", "Team1"),
HomeScore = c(10, 5, 12, 18, 17, 19, 23, 17, 34, 19, 8, 3),
AwayScore = c(4, 16, 9, 19, 16, 4, 8, 21, 6, 5, 9, 17),
Venue = c("Ground1", "Ground2", "Ground3", "Ground3", "Ground1", "Ground2", 
          "Ground1", "Ground3", "Ground2", "Ground3", "Ground4", "Ground2"))

我基本上想通过计数将“HomeTeam”和“AwayTeam”汇总到一个新表中,如下所示

 HomeTeam NumberOfGamesHome NumberOfGamesaWAY
 <chr>                <int>             <int>
 1 Team1                    2                 4
 2 Team2                    5                 2
 3 Team3                    3                 3
 4 Team4                    2                 3

我目前的方法需要两行分组代码,然后加入表格

HomeTeamCount <- data %>% 
group_by(HomeTeam) %>% 
summarise(NumberOfGamesHome = n()) 

AwayTeamCount <- data %>% 
group_by(AwayTeam) %>% 
summarise(NumberOfGamesAway = n()) 

Desired <- left_join(HomeTeamCount, AwayTeamCount, 
                 by = c("HomeTeam" = "AwayTeam"))

在我的实际数据集中,我有大量的分类变量,按照上面的方法似乎很费力,效率低下

有没有办法使用 dplyr 对多个分类变量进行分组,以产生所需的输出?或者可能是 data.table?

我咨询了herehere等其他几个问题,但一直没能找到答案。

【问题讨论】:

    标签: r group-by count dplyr


    【解决方案1】:

    这是一种使用gather 将数据从宽向长传播的可能性,按球队分组并汇总主客场比赛的数量。

    library(tidyverse)
    data %>%
        gather(key, Team) %>%
        group_by(Team) %>%
        summarise(
            NumberOfGamesHome = sum(key == "HomeTeam"),
            NumberOfGamesaWAY = sum(key == "AwayTeam"))
    ## A tibble: 4 x 3
    #  Team  NumberOfGamesHome NumberOfGamesaWAY
    #  <chr>             <int>             <int>
    #1 Team1                 2                 4
    #2 Team2                 5                 2
    #3 Team3                 3                 3
    #4 Team4                 2                 3
    

    更新

    要使用其他列反映更新后的示例数据,您可以这样做

    data %>%
        gather(key, Team, HomeTeam, AwayTeam) %>%
        group_by(Team) %>%
        summarise(
            NumberOfGamesHome = sum(key == "HomeTeam"),
            NumberOfGamesaWAY = sum(key == "AwayTeam"))
    ## A tibble: 4 x 3
    #  Team  NumberOfGamesHome NumberOfGamesaWAY
    #  <chr>             <int>             <int>
    #1 Team1                 2                 4
    #2 Team2                 5                 2
    #3 Team3                 3                 3
    #4 Team4                 2                 3
    

    【讨论】:

    • 感谢您的回复 这绝对适用于我提供的数据。在我的实际数据中,我有其他数字和字符变量,这些变量使用此解决方案分组到关键的“团队”列中。我可以从“group_by”函数中排除这些变量吗?
    • 您能否更新您的问题以显示一些示例数据和您的预期输出?您可以在 gather 期间排除列,但确切的语法取决于您的特定数据和预期输出。
    • 谢谢,我刚刚编辑了我的数据以更准确地反映源数据集
    • @perkot 谢谢,我已经更新了我的答案,请看一下。
    猜你喜欢
    • 1970-01-01
    • 2018-10-04
    • 2017-02-11
    • 1970-01-01
    • 1970-01-01
    • 2016-04-01
    • 1970-01-01
    • 1970-01-01
    • 2021-02-03
    相关资源
    最近更新 更多