奇怪的 group_by + mutate + which.max 行为答案

【问题标题】：Weird group_by + mutate + which.max behavior奇怪的 group_by + mutate + which.max 行为
【发布时间】：2016-08-29 01:15:29
【问题描述】：

我遇到了dplyr 的意外行为：

library(dplyr)

df <- structure(list(date = c("2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", "2016-05-02", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-03", 
      "2016-05-03", "2016-05-03", "2016-05-03", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", "2016-05-04", 
      "2016-05-04", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", 
      "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-05", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", "2016-05-06", 
      "2016-05-06", "2016-05-06"), abc = c(NA, NA, NA, NA, NA, NA, 
         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
         NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 20, 20, 16, 
         14, 9, 8, 6, 5, 5, 6, 7, 13, 24, 52, 65, 68, 66, 65, 58, 47, 
         21, 6, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
         1, 1, 0, 0, 0, 0, 0, 10, 19, 19, 15, 11, 8, 8, 5, 4, 4, 4, 5, 
         9, 17, 31, 43, 49, 52, 52, 47, 32, 21, 6, 2, 1, 1, 1, 1, 1, 1, 
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 5, 14, 
         14, 14, 15, 18, 18, 14, 14, 14, 15, 19, 29, 46, 58, 62, 69, 71, 
         67, 56, 40, 25, 8, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
         2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 10, 18, 18, 14, 12, 9, 7, 5, 
         4, 5, 5, 7, 9, 17, 30, 36, 49, 52, 54, 54, 42, 32, 15, 5, 1)), 
     class = "data.frame", row.names = c(NA, -240L), .Names = c("date", "abc"))


df %>%
  group_by(date) %>%
  mutate(peak_max_index = as.numeric(which.max(as.numeric(abc))))

我希望返回的是peak_max_index，即41，对于date 为2016-05-04 的所有行。但奇怪的是peak_max_index 是NA。更奇怪的是，如果你在运行dplyr 命令之前踢掉所有date 是2016-05-03 的行，结果是完全正确的。这是一个错误吗？

【问题讨论】：

你试过df %>% group_by(date) %>% mutate(peak_max_index = as.numeric(which.max(as.numeric(abc)))) %>% filter(date == '2016-05-04')吗？这表明第一部分正在做正确的事情。 packageVersion('dplyr')show 是什么？
这给了我同样的结果。包版本为0.4.3
那么，结果的哪一部分是奇怪的？我将您的命令结果保存到 df 并以这种方式子集 df[df$date == '2016-05-04', ] 并且仍然为所有行获得 41。
顺便说一句 - dplyr 0.4.3 中存在一些错误（与此问题无关），因此我使用开发版本 0.4.3.9001。
对于同一个子集，我得到所有行的“NA”。可能必须尝试开发版本。

标签： r dplyr

【解决方案1】：

您正在公式which.max() 中评估NA's。只需用!is.na() 消除NA's。

df %>%
    group_by(date) %>%
    mutate(peak_max_index = max(df$abc[!is.na(df$abc)]))

【讨论】：

我不是。产生NA 的组中的abc 中没有NAs，我正在寻找which.max 而不是max。
那么你需要使用aggregate(. ~ date, df, FUN = max, na.action = NULL)。然后，您可以将最大值子集到日期相等的df 或使用merge(df, df.agg, by = "date")。
那仍然不会给我想要的结果，即最大值中第一个的索引。
然后使用 dplyr 的汇总和合并应该可以做到：df %>% group_by(date) %>% summarise(as.numeric(which.max(as.numeric(abc)))) %>% merge(df, by = "date")