跨组的 dplyr 中级/高级过滤答案

【问题标题】：intermediate/advanced filtering in dplyr across groups跨组的 dplyr 中级/高级过滤
【发布时间】：2021-01-15 08:42:44
【问题描述】：

我主动提出帮助一位朋友解决问题，但我很快意识到这超出了我的能力范围。我有兴趣过滤以删除位于另一个组的第一条记录之上或之后的组的记录。

我按“物种”、“年份”和“性别”进行分组，并希望删除“m”的第一个“observation_doy”之后出现的“sex”为“f”的所有记录。在本例中，我要删除的记录以粗体表示。

library(tidyverse)
library(janitor)

原创

    species               generations  year  observation_doy sex
1   Linnaea borealis       partially bimodal    2009    165 f
2   Linnaea borealis       partially bimodal    2010    150 f
3   Linnaea borealis       partially bimodal    2010    155 f
4   Linnaea borealis       partially bimodal    2010    160 m
**5 Linnaea borealis       partially bimodal    2010    160 f**
6   helianthus deserticola  partially bimodal   2009    174 f
7   helianthus deserticola  partially bimodal   2009    174 f
8   helianthus deserticola  partially bimodal   2009    180 m
**9 helianthus deserticola  partially bimodal   2009    180 f**
10  helianthus deserticola  partially bimodal   2009    184 m
11  helianthus deserticola  partially bimodal   2010    174 f
12  helianthus deserticola  partially bimodal   2010    174 f
**13    helianthus deserticola  partially bimodal   2010    180 f**
14  helianthus deserticola  partially bimodal   2010    180 m
15  helianthus deserticola  partially bimodal   2010    184 m
16  helianthus deserticola  partially bimodal   2011    174 f
17  helianthus deserticola  partially bimodal   2011    174 f
18  helianthus deserticola  partially bimodal   2011    180 f
19  helianthus deserticola  partially bimodal   2011    180 m
**20    helianthus deserticola  partially bimodal   2011    184 f
21  helianthus deserticola  partially bimodal   2011    184 f**
22  helianthus bolanderi    partially bimodal   2009    174 f
23  helianthus bolanderi    partially bimodal   2009    174 f
24  helianthus bolanderi    partially bimodal   2009    180 m
**25    helianthus bolanderi    partially bimodal   2009    180 f**
26  helianthus bolanderi    partially bimodal   2009    184 m

期望的结果：

    
  species                generations       year observation_doy sex 
1   Linneae borealis        partially bimodal   2009    165 f
2   Linneae borealis        partially bimodal   2010    150 f
3   Linneae borealis        partially bimodal   2010    155 f
4   Linneae borealis       partially bimodal    2010    160 m
5   helianthus deserticola  partially bimodal   2009    174 f
6   helianthus deserticola  partially bimodal   2009    174 f
7   helianthus deserticola  partially bimodal   2009    180 m
8   helianthus deserticola  partially bimodal   2009    180 f
9   helianthus deserticola  partially bimodal   2009    184 m
10  helianthus deserticola  partially bimodal   2010    174 f
11  helianthus deserticola  partially bimodal   2010    174 f
12  helianthus deserticola  partially bimodal   2010    180 m
13  helianthus deserticola  partially bimodal   2010    184 m
14  helianthus deserticola  partially bimodal   2011    174 f
15  helianthus deserticola  partially bimodal   2011    174 f
16  helianthus deserticola  partially bimodal   2011    180 m
17  helianthus bolanderi    partially bimodal   2009    174 f
18  helianthus bolanderi    partially bimodal   2009    174 f
19  helianthus bolanderi    partially bimodal   2009    180 m
20  helianthus bolanderi    partially bimodal   2009    184 m

原始数据集大约有 10 列的 10k 条记录，因此非常易于管理。但是我似乎没有解决这个问题的好方法。以下是我尝试过的一些方法——以及我怀疑它们失败的原因；我再次怀疑这些是无效的方法。

按组、过滤器和切片查找第一个雄性出现日期非常容易。我怀疑我可以在此之后创建一个数字字符串并使用 %notin%（否定 %in%）来删除日期之后的女性记录。但我不知道如何将 %notin% 限制为该子集，除非过滤掉数据集。但是，这种方法似乎很糟糕，并且会产生多个中间体。

first_male_emergence <- df %>% dplyr::filter(generations == 'partially bimodal') %>% 
  dplyr::group_by(species, year, sex) %>% 
  dplyr::filter(sex == 'm') %>% 
  dplyr::slice_min(sampling_doy, n=1)

我还尝试创建一个“双面”过滤器，这似乎不符合 dplyr 的理念。我认为这种方法的问题是让 dplyr 识别主要过滤标准，在这种情况下是“

clean_df <- raw_df %>%  
  group_by(species, year, sex) %>%
  dplyr::filter(sex == 'f' & sampling_doy <= print(filter(sex == 'm',(slice_min(sampling_doy, n = 1)))))

请注意，我在右侧的表达式上有“打印”，试图返回数值以供

最后我尝试使用 case_when，但是由于许多运算符，允许该函数确定 LHS 和 RHS 的位置也存在问题。

clean_df <- raw_df %>% 
  group_by(species, year, sex) %>% 
  mutate(sampling_doy_1 = case_when(
    (sex == 'f' & sampling_doy <=
       filter(sex == 'm',(slice_max(sampling_doy, n = 1)) ~ sampling_doy, 
    (sex == 'f' & sampling_doy >= 
       filter(sex == 'm',(slice_max(sampling_doy, n = 1)) ~ NA,
  ))))))

我还尝试了一个变体：

clean_df <- raw_df %>% 
  dplyr::filter(generations == 'partially bivoltine') %>% 
  group_by(species, year, sex) %>% 
  mutate(sampling_doy_1 = case_when(
  sex == 'f' & sampling_doy < sex == 'm', sampling_doy ~ sampling_doy,
  sex == 'f' & sampling_doy > sex == 'm', sampling_doy ~ NA,
))

我也考虑过在 group by 中使用排列，并尝试在第一个男性记录之前对所有女性记录进行切片。但是，这似乎不是一个好方法。

所以我的第一个问题是：谁能解决这个问题？在我看来，使用 case_when 并将左侧表达式转换为在 case_when 中调用的函数是最好的做法。但是，与此同时，我觉得这根本不是 case_when 根据我发现的其他示例设置为使用的方式。我有时会在其中加入一些非常简单的数学，但通常非常简单的数学在感兴趣的列中表现相同。

第二个问题是：对于这个主题是否有普遍建议的方法，或者我是否必须依靠编写函数来完成类似的事情？

对于这篇文章的篇幅深表歉意，但非常感谢任何帮助。我还标记了数据表，因为它似乎有一个不错的简单解决方案。

【问题讨论】：

标签： r dplyr datatable data-wrangling

【解决方案1】：

这可以通过连接来完成。

male_first_obs <- my_data %>%
  group_by(species, year) %>%
  filter(sex == "m") %>%
  summarize(male_first_obs_doy = min(observation_doy))
 
my_data %>%
  left_join(male_first_obs, by = c(species, year)) %>%
  group_by(species, year) %>%
  filter(!(sex == "f" & observation_day > male_first_obs_doy)) %>%
  select(-male_first_obs_doy)

如果您不关心dbplyr 兼容性等，您可能会得到更简洁，例如：

my_data %>%
  group_by(species, year) %>%
  mutate(male_first_obs_day = min(observation_doy[which(sex == "m")])) %>%
  filter(!(sex == "f" & observation_day > male_first_obs_doy)) %>%
  select(-male_first_obs_doy)

【讨论】：