【问题标题】:Using dplyr to summarize and keep the same variable name使用 dplyr 汇总并保持相同的变量名
【发布时间】:2018-06-29 16:25:30
【问题描述】:

我发现 data.table 和 dplyr 在尝试做同样的事情时有不同的结果。我想使用 dplyr 语法,但让它以 data.table 的方式计算。用例是我想将小计添加到表中。为此,我需要对每个变量进行一些聚合,然后保持相同的变量名称(在转换后的版本中)。 Data.table 允许我对变量执行一些聚合并保持相同的名称。然后使用相同的变量进行另一个聚合。它将继续使用未转换的版本。然而,Dplyr 将使用转换后的版本。

summarize 文档中它说:

# Note that with data frames, newly created summaries immediately
# overwrite existing variables
mtcars %>%
  group_by(cyl) %>%
  summarise(disp = mean(disp), sd = sd(disp))

这基本上是我遇到的问题,但我想知道是否有一个很好的解决方法。我发现的一件事是将转换后的变量命名为其他名称,然后在最后 rename 它,但这对我来说看起来不太好。如果有一种很好的方法来进行小计,那也很高兴知道。我环顾了这个站点,并没有看到讨论的确切情况。任何帮助将不胜感激!

这里我做了一个简单的例子,一次是data.table的结果,一次是dplyr的。我想使用这个简单的表格并附加一个小计行,它是感兴趣的列(总计)的加权平均值。

library(data.table)
library(dplyr)

dt <- data.table(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))
dt[, Count_Dist := Count/sum(Count)]
dt[, .(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Count_Dist*Total))]

dt <- rbind(dt[, .(Group, Count_Dist, Total)],
      dt[, .(Group = "All", Count_Dist = sum(Count_Dist), Total = sum(Count_Dist*Total))])
setnames(dt, "Total", "Weighted_Avg_Total")

dt

df <- data.frame(Group = LETTERS[1:5],
                 Count = c(1000, 1500, 1200, 2000, 5000),
                 Total = c(50, 300, 600, 400, 1000))

df %>%
  mutate(Count_Dist = Count/sum(Count)) %>%
  summarize(Count_Dist = sum(Count_Dist),
            Weighted_Total = sum(Count_Dist*Total))

df %>% 
  mutate(Count_Dist = Count/sum(Count)) %>%
  select(Group, Count_Dist, Total) %>% 
  rbind(df %>%
          mutate(Count_Dist = Count/sum(Count)) %>%
          summarize(Group = "All",
                    Count_Dist = sum(Count_Dist),
                    Total = sum(Count_Dist*Total))) %>% 
  rename(Weighted_Avg_Total = Total)

再次感谢您的帮助!

【问题讨论】:

    标签: r variables dplyr data.table summarize


    【解决方案1】:

    一种可能的解决方案是跳过mutate 步骤,并在第一个mutate/select 步骤中使用transmute,并直接从原始变量计算所需变量,而不为第二个@ 创建中间变量987654325@-step:

    df %>% 
      transmute(Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total) %>% 
      bind_rows(df %>%
                  summarize(Group = "All",
                            Count_Dist = sum(Count/sum(Count)),
                            Weighted_Avg_Total = sum((Count/sum(Count))*Total)))
    

    给出:

      Group Count_Dist Weighted_Avg_Total
    1     A 0.09345794            50.0000
    2     B 0.14018692           300.0000
    3     C 0.11214953           600.0000
    4     D 0.18691589           400.0000
    5     E 0.46728972          1000.0000
    6   All 1.00000000           656.0748
    

    另一种可能的解决方案是更改在dplyr 中计算新变量的顺序,然后使用select 将列顺序恢复为您最初想要的:

    df %>% 
      mutate(Count_Dist = Count/sum(Count)) %>%
      select(Group, Count_Dist, Weighted_Avg_Total = Total) %>% 
      bind_rows(df %>%
                  mutate(Count_Dist = Count/sum(Count)) %>%
                  summarize(Group = "All",
                            Weighted_Avg_Total = sum(Count_Dist*Total),
                            Count_Dist = sum(Count_Dist)) %>% 
                  select(Group, Count_Dist, Weighted_Avg_Total))
    

    如果您也想包含Count-列,您可以这样做(根据我在下面的评论):

    df %>% 
      transmute(Group = Group, Count_Dist = Count/sum(Count), Weighted_Avg_Total = Total, Count) %>% 
      bind_rows(df %>%
                  summarize(Group = "All",
                            Count_Dist = sum(Count/sum(Count)),
                            Weighted_Avg_Total = sum((Count/sum(Count))*Total),
                            Count = sum(Count)))
    

    【讨论】:

    • 感谢您的帮助!你知道我也可以保留 Count 变量的方法吗?这样,它将具有 Group、Count、Count_Dist 和 Weighted_Avg_Total 以及“All”组。
    • @Hutch3232 只需在bind_rows 中添加Count = CounttransmuteCount = sum(Count)summarise。在这两种情况下,最容易将它们添加为最后一个,这样可以防止问题中描述的问题。
    • 有道理,再次感谢!我也刚刚意识到(与 rbind 不同)bind_rows 不需要两个 data.frames 的列顺序相同。所以我把我想要的顺序放在第一个 transmute 中,然后 bind_rows 强制下一个 data.frame 进入这个顺序。我在原始帖子中发布了我们的解决方案。谢谢!
    【解决方案2】:

    另一种方法是使用mutate 两次来计算偶数Weighted_Total 并使用summarize 中该列的sum

    df %>%
      mutate(Count_Dist = Count/sum(Count)) %>%
      mutate(Weighted_Total = Count_Dist*Total) %>%
      summarize(Count_Dist = sum(Count_Dist),
                Weighted_Total = sum(Weighted_Total))
    Result:
      Count_Dist Weighted_Total
    1          1     656.074766
    

    还有:

        df %>% 
          mutate(Count_Dist = Count/sum(Count)) %>%
          select(Group, Count_Dist, Total) %>% 
          rbind(df %>%
                  mutate(Count_Dist = Count/sum(Count)) %>%
                  mutate(Weighted_Total = Count_Dist*Total) %>%
                  summarize(Group = "All",
                            Count_Dist = sum(Count_Dist),
                            Total = sum(Weighted_Total))) %>% 
          rename(Weighted_Avg_Total = Total)
    
    Result:
    
          Group   Count_Dist Weighted_Avg_Total
        1     A 0.0934579439          50.000000
        2     B 0.1401869159         300.000000
        3     C 0.1121495327         600.000000
        4     D 0.1869158879         400.000000
        5     E 0.4672897196        1000.000000
        6   All 1.0000000000         656.074766
    

    【讨论】:

    • 感谢您的帮助!似乎这两个突变并不是完全必要的,因为在我看来,以下代码会产生相同的结果:df %&gt;% mutate(Count_Dist = Count/sum(Count), Weighted_Total = Count_Dist*Total) %&gt;% summarize(Count_Dist = sum(Count_Dist), Weighted_Total = sum(Weighted_Total))
    • @Hutch3232 没错!!更改mutate 将为您提供所需的输出。
    猜你喜欢
    • 2019-01-27
    • 2014-06-23
    • 1970-01-01
    • 2018-10-22
    • 1970-01-01
    • 1970-01-01
    • 2016-04-08
    • 1970-01-01
    相关资源
    最近更新 更多