dplyr：保留因子的空级别，但不保留未出现在数据中的因子组合的空级别答案

【问题标题】：dplyr: keep empty levels of factor but not empty levels of a combination of factors that don't appear in datadplyr：保留因子的空级别，但不保留未出现在数据中的因子组合的空级别
【发布时间】：2019-10-13 18:35:32
【问题描述】：

使用dplyr进行分组和汇总时，保留每个分组因子的空级别但不保留多个分组因子的空组合的正确方法是什么？

例如，考虑在多个站点的不同时间记录的数据。我可能会过滤然后计算每个站点中每年的某些内容。如果过滤器完全删除一年，我希望在空向量上使用摘要的默认值。所以站点“a”有 10 年，站点“b”有 1 年，所以我总是希望摘要中有 11 行。

如果我在group_by 中使用.drop = TRUE，我会失去几年：

library(dplyr)
library(zoo)
library(lubridate)

set.seed(1)

df <- data.frame(site = factor(c(rep("a", 120), rep("b", 12))),
                 date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
                 value = rnorm(132, 50, 10))
df$year <- factor(lubridate::year(df$date))

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = TRUE) %>%
  summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups:   site [1]
#>   site  year  f         
#>   <fct> <fct> <date>    
#> 1 a     2000  2000-04-01
#> 2 a     2004  2004-08-01
#> 3 a     2005  2005-01-01
#> 4 a     2007  2007-11-01
#> 5 a     2008  2008-10-01
#> 6 a     2009  2009-02-01

通过.drop = FALSE，我获得了站点“b”的所有额外年限，这些年均不在原始数据中：

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = FALSE) %>%
  summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups:   site [2]
#>    site  year  f         
#>    <fct> <fct> <date>    
#>  1 a     2000  2000-04-01
#>  2 a     2001  NA        
#>  3 a     2002  NA        
#>  4 a     2003  NA        
#>  5 a     2004  2004-08-01
#>  6 a     2005  2005-01-01
#>  7 a     2006  NA        
#>  8 a     2007  2007-11-01
#>  9 a     2008  2008-10-01
#> 10 a     2009  2009-02-01
#> 11 b     2000  NA        
#> 12 b     2001  NA        
#> 13 b     2002  NA        
#> 14 b     2003  NA        
#> 15 b     2004  NA        
#> 16 b     2005  NA        
#> 17 b     2006  NA        
#> 18 b     2007  NA        
#> 19 b     2008  NA        
#> 20 b     2009  NA

我能想到的最好方法是计算计数，然后合并然后过滤，然后删除计数变量，但这很麻烦。我知道.drop 是最近才添加到dplyr 的，这对于一个因素非常有用，但是还有一种干净的方法可以为多个因素做到这一点吗？

df %>% 
  filter(value > 65) %>%
  group_by(site, year, .drop = FALSE) %>%
  summarise(f = first(date)) %>%
  left_join(df %>% count(site, year, .drop = FALSE), by = c("site", "year")) %>%
  filter(n > 0) %>%
  select(-n)
#> # A tibble: 11 x 3
#> # Groups:   site [2]
#>    site  year  f         
#>    <fct> <fct> <date>    
#>  1 a     2000  2000-04-01
#>  2 a     2001  NA        
#>  3 a     2002  NA        
#>  4 a     2003  NA        
#>  5 a     2004  2004-08-01
#>  6 a     2005  2005-01-01
#>  7 a     2006  NA        
#>  8 a     2007  2007-11-01
#>  9 a     2008  2008-10-01
#> 10 a     2009  2009-02-01
#> 11 b     2000  NA

【问题讨论】：

为什么只保留b 2000 而没有像b 2001 这样的任何其他级别？
在原始数据中只有网站b的数据为2000（12行）

标签： r dplyr

【解决方案1】：

不确定这是不是你喜欢的。

如果您用 NA 替换日期为 value < 65 而不是过滤掉它们，您可以照常进行。



df %>% 
  mutate(date = replace(date, value < 65, NA)) %>%
  group_by(site, year) %>%
  summarise(f = first(date[!is.na(date)]))

# A tibble: 11 x 3
# Groups:   site [2]
   site  year  f         
   <fct> <fct> <date>    
 1 a     2000  NA        
 2 a     2001  NA        
 3 a     2002  2002-03-01
 4 a     2003  NA        
 5 a     2004  NA        
 6 a     2005  NA        
 7 a     2006  2006-02-01
 8 a     2007  NA        
 9 a     2008  2008-07-01
10 a     2009  2009-02-01
11 b     2000  2000-08-01

【讨论】：

谢谢，这样可以简化为df %>% group_by(site, year) %>% summarise(f = first(date[value > 65]))。如果这可以在dplyr 中完全完成，而不必使用在大数据上速度很慢的[，那就太好了。
或者直接做df %>% group_by(site, year) %>% summarise(fn = first(date[value > 65]))