在 R 中过滤掉 dplyr 中的组后，如何返回非聚合数据？答案

【问题标题】：How do I get back to the non-aggregated data after filtering out groups in dplyr within R?在 R 中过滤掉 dplyr 中的组后，如何返回非聚合数据？
【发布时间】：2019-01-24 18:52:23
【问题描述】：

假设我有这样的数据：

id = c(1,1,2,2,3,3)
date = as.Date(c('2018-01-02', '2018-01-03', '2017-07-01', '2018-01-02', '2017-08-02', '2017-08-03'))
df <- data.frame(id, date)

id date
1  2018-01-02
1  2018-01-03
2  2017-07-01
2  2018-01-02
3  2017-08-02
3  2017-08-03

我想过滤掉所有日期不小于 2018-01-01 的 ID。这是我要访问的表：

id date
2  2017-07-01
2  2018-01-02
3  2017-08-02
3  2017-08-03

我可以用这个过滤掉我不想要的组：

library(dplyr)
df %>% group_by(id) %>%
summarise(min_date = min(date)) %>%
filter(min_date <= as.Date('2018-01-01'))

但这给了我汇总的结果。

 id min_date    
  2 2017-07-01
  3 2017-08-02

我真正想要的是删除了 id 1s 的原始未聚合数据。

我正在使用 sparklyr 和 dplyr。

【问题讨论】：

试试不带summarise。对数据进行分组将允许您按组过滤最小值，例如 filter(min(date) <= as.Date("2018-01-01"))
这似乎有效。非常感谢。

标签： r group-by dplyr sparklyr

【解决方案1】：

您可以将group_by %>% filter 与按组聚合的过滤条件一起使用：

df %>% group_by(id) %>% filter(any(date < '2018-01-01'))
# note any(date < '2018-01-01') returns a boolean scalar for each group and determine whether
# rows in the group should be kept or not

# A tibble: 4 x 2
# Groups:   id [2]
#     id date      
#  <dbl> <date>    
#1     2 2017-07-01
#2     2 2018-01-02
#3     3 2017-08-02
#4     3 2017-08-03

【讨论】：

这基本上有效。我唯一改变的是我使用 min 而不是 any。我猜我正在使用的 sparklyr 版本无法识别。感谢您的帮助。