dplyr 汇总的变量结果，取决于输出变量命名答案

【问题标题】：Variable results with dplyr summarise, depending on output variable namingdplyr 汇总的变量结果，取决于输出变量命名
【发布时间】：2016-02-11 20:11:27
【问题描述】：

我正在使用 dplyr 包 (dplyr 0.4.3; R 3.2.3) 对分组数据 (summarise) 进行基本摘要，但得到的结果不一致（'sd' 为 NaN，并且不正确count for 'N"）。更改输出的“名称”会产生不同的效果（下面的示例）。

到目前为止的结果总结：

plyr 包未加载，我知道如果先加载，dplyr 可能会出现问题。
使用或不使用 NA 数据获得的结果相同（未显示）。
可以通过使用 camelCase 变量命名（未显示）或使用名称中没有非字母数字分隔符的输出变量来解决问题。
根据“.”的组合仍然可以获得有效的结果。或输出列名称中的“_”。

问题：虽然这个问题可以解决，但我是否违反了我正在违反的基本变量命名规则，或者是否存在需要解决的程序问题？我已经看到了其他具有可变行为的问题摘要，但不完全是这样。

谢谢，马特

示例数据：

library(dplyr)
df<-data_frame(id=c(1,1,1,2,2,2,3,3,3),
       time=rep(1:3, 3),
       glucose=c(90,150, 200,
                 100,150,200,
                 80,100,150))

示例：sd 给出 NaN 和不准确的 n

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose.sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

我想知道使用“。”是否有问题。名义上，或使用与数据框中相同的名称。从输出中删除现有的 df col 名称可以解决此问题

df %>% group_by(time) %>%
  summarise(avg=mean(glucose, na.rm=TRUE),
        stdv=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time      avg     stdv     n
  (int)    (dbl)    (dbl) (int)
1     1  90.0000 10.00000     3
2     2 133.3333 28.86751     3
3     3 183.3333 28.86751     3

删除“葡萄糖”摘要也可以修复它，即使留下“葡萄糖.sd” 例子：去掉“葡萄糖”后，结果OK

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose.sd     n
  (int)      (dbl) (int)
1     1   10.00000     3
2     2   28.86751     3
3     3   28.86751     3

如果我在第一个摘要中添加“glucose.mean”，它可以正常工作

df %>% group_by(time) %>%
  summarise(glucose.mean=mean(glucose, na.rm=TRUE),
            glucose.sd=sd(glucose, na.rm=TRUE),
            n=sum(!is.na(glucose)))

   time glucose.mean glucose.sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

使用不带“.”的变量名时同样的错误所以这不仅仅是使用“。”的问题。名义上

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time  glucose glucose_sd     n
  (int)    (dbl)      (dbl) (int)
1     1  90.0000        NaN     1
2     2 133.3333        NaN     1
3     3 183.3333        NaN     1

将“葡萄糖”重命名为“葡萄糖平均”有效

df %>% group_by(time) %>%
  summarise(glucose_mean=mean(glucose, na.rm=TRUE),
        glucose_sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

   time glucose_mean glucose_sd     n
  (int)        (dbl)      (dbl) (int)
1     1      90.0000   10.00000     3
2     2     133.3333   28.86751     3
3     3     183.3333   28.86751     3

【问题讨论】：

标签： r dplyr

【解决方案1】：

您在summarize 中指定的转换按它们出现的顺序执行，这意味着如果您更改变量值，那么这些新值将出现在后续列中（这与基本函数tranform() 不同）。当你这样做时

df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))

glucose=mean(glucose, na.rm=TRUE) 部分更改了 glucose 变量的值，因此当您计算 glucose.sd=sd(glucose, na.rm=TRUE) 部分时，sd() 看不到原始葡萄糖值，它看到的是新值，即平均值的原始值。如果您对列重新排序，它将起作用。

df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)), 
        glucose=mean(glucose, na.rm=TRUE))

如果您想知道为什么这是默认行为，这是因为创建列然后在稍后的转换中使用该列值通常很好。例如，mutate()

df %>% group_by(time) %>%
  mutate(glucose_sq = glucose^2,
        glucose_sq_plus2 = glucose_sq+2)

【讨论】：

非常感谢。这就说得通了。所以我创建的真正问题是使用相同的输出和输入变量名。它与分隔符无关。