条件匹配时滞后于 R答案

【问题标题】：Lagging in R when a condition is matched条件匹配时滞后于 R
【发布时间】：2020-01-27 03:03:05
【问题描述】：

我有一个数据框，其中只有体检日期和是否存在感染（是/否），我想添加第三列表示上次感染的日期。如果患者之前没有感染，则新的last_infection 列应该有NA。如果他们以前感染过，则应显示他们最近一次最近访问的日期，他们测试“是”感染。

我希望输出如下所示：

date      infection   last_infection
01-01-18  no          NA
06-01-18  no          NA
07-01-18  yes         NA
09-01-18  no          07-01-18
01-01-19  no          07-01-18
02-01-19  yes         07-01-18
03-01-19  yes         02-01-19
04-01-19  no          03-01-19
05-01-19  no          03-01-19

如何在 R 中做到这一点？像lag() 这样的函数可以检查条件，还是我应该完全做其他事情？

【问题讨论】：

标签： r function lag

【解决方案1】：

我会建议这样的东西。如果您使用 tidyr 包中的 fill，则没有理由使用 cumsum 或分组。

library(tidyverse)

df %>% 
  mutate(
    last_infection = if_else(lag(infection) == "yes", lag(date), NA_character_)
  ) %>% 
  fill(last_infection)
#> # A tibble: 9 x 3
#>   date     infection last_infection
#>   <chr>    <chr>     <chr>         
#> 1 01-01-18 no        <NA>          
#> 2 06-01-18 no        <NA>          
#> 3 07-01-18 yes       <NA>          
#> 4 09-01-18 no        07-01-18      
#> 5 01-01-19 no        07-01-18      
#> 6 02-01-19 yes       07-01-18      
#> 7 03-01-19 yes       02-01-19      
#> 8 04-01-19 no        03-01-19      
#> 9 05-01-19 no        03-01-19

^{由reprex package (v0.3.0) 于 2020-01-25 创建}

【讨论】：

【解决方案2】：

我们可以根据使用“感染”创建的逻辑向量创建一个分组变量，并将其用于lag 列。在这里，我们只加载dplyr，而不加载任何其他包

library(dplyr)
df1 %>%
   group_by(grp = cumsum(infection == "yes")) %>%
   mutate(new = first(date)) %>%
   ungroup %>%
   mutate(new = replace(lag(new), seq_len(match(1, grp)), NA)) %>%
   select(-grp)
# A tibble: 9 x 4
#  date     infection last_infection new     
#  <chr>    <chr>     <chr>          <chr>   
#1 01-01-18 no        <NA>           <NA>    
#2 06-01-18 no        <NA>           <NA>    
#3 07-01-18 yes       <NA>           <NA>    
#4 09-01-18 no        07-01-18       07-01-18
#5 01-01-19 no        07-01-18       07-01-18
#6 02-01-19 yes       07-01-18       07-01-18
#7 03-01-19 yes       02-01-19       02-01-19
#8 04-01-19 no        03-01-19       03-01-19
#9 05-01-19 no        03-01-19       03-01-19

数据

df1 <- structure(list(date = c("01-01-18", "06-01-18", "07-01-18", "09-01-18", 
"01-01-19", "02-01-19", "03-01-19", "04-01-19", "05-01-19"), 
    infection = c("no", "no", "yes", "no", "no", "yes", "yes", 
    "no", "no"), last_infection = c(NA, NA, NA, "07-01-18", "07-01-18", 
    "07-01-18", "02-01-19", "03-01-19", "03-01-19")),
    class = "data.frame", row.names = c(NA, 
-9L))

【讨论】：

这太好了，谢谢@akrun！你能解释一下first(date) 部分的工作原理吗？
@kss 发生的情况是，每当“感染”列中有“是”时，“grp”列的值就会增加 1。因此，当我们执行group_by 时，“日期”中“grp”1 的第一个观察结果将在第三行，高于该值将是 grp 0（因为在“感染”中都是“否”）。这就是我使用first 的原因。稍后，我们将 replace 将值设置为 NA，以获取以“是”开头的前 2 个元素
知道了，这非常有帮助。欣赏它。
不清楚这里的否决票。我展示了一种不使用更多外部包的方法。我记得对另一个问题here 的另一个反对意见，因为另一个答案帖子显示了一些极端情况。如果同一个人正在投反对票，我会报告它，因为这里的反对票是不必要的。比方说，明天另一个人提出了一个单行代码，我们是否会否决其他解决方案？
@kss 我会在那个情况下做df1 %>% group_by(grp = cumsum(infection == "yes")) %>% mutate(new = if(any(grp > 0)) first(date) else NA) %>% ungroup