【问题标题】:Check if any dates in a group are within specific time intervals for that group in r检查组中的任何日期是否在 r 中该组的特定时间间隔内
【发布时间】:2019-04-22 19:28:17
【问题描述】:

我想创建一个新变量来指示 visit_date 是否在为 id 列出的任何日期范围内

我已使用此代码逐行比较,但我想扩展此代码并将 id 的所有行与为该 id 列出的所有间隔行进行比较

df <- df %>%
  group_by(id) %>%
  mutate(between_any = ifelse((visit_date >= start & visit_date <= end), 1))

我也尝试过在变异之前创建一个区间变量并使用crossing(visit_date, interval),但是我无法让cross为日期对象工作。

以下是一些示例数据:

df <- data.frame(id = c("a","a","a","a","a","b","b","b"),
                 visit_date = c("2001-08-22","2001-09-21","2001-10-30","2001-11-10","2001-12-20","2002-12-22", "2003-04-30","2003-05-10"),
                 start = c(NA,"2001-09-21",NA,"2001-11-10",NA,"2002-12-22", "2003-04-30",NA),
                 end = c(NA, "2001-11-01",NA,"2001-11-10",NA,"2002-12-22","2003-06-01",NA))

> df
id visit_date    start        end
a 2001-08-22       <NA>       <NA>
a 2001-09-21 2001-09-21 2001-11-01
a 2001-10-30       <NA>       <NA>
a 2001-11-10 2001-11-10 2001-11-10
a 2001-12-20       <NA>       <NA>
b 2002-12-22 2002-12-22 2002-12-22
b 2003-04-30 2003-04-30 2003-06-01
b 2003-05-10       <NA>       <NA>

我想要的输出如下:

id visit_date      start       end   between_any
a 2001-08-22       <NA>       <NA>      0
a 2001-09-21 2001-09-21 2001-11-01      1
a 2001-10-30       <NA>       <NA>      1
a 2001-11-10 2001-11-10 2001-11-10      1
a 2001-12-20       <NA>       <NA>      0
b 2002-12-22 2002-12-22 2002-12-22      1
b 2003-04-30 2003-04-30 2003-06-01      1
b 2003-05-10       <NA>       <NA>      1

提前致谢!

【问题讨论】:

    标签: r dplyr lubridate


    【解决方案1】:

    data.table 包中的in_range 函数正是这样做的...

    library(data.table)
    
    df <- df %>%
      group_by(id) %>%
      mutate(between_any = as.numeric((inrange(visit_date, start, end))))
    
    #> df
    #  id visit_date      start        end between_any
    #1  a 2001-08-22       <NA>       <NA>           0
    #2  a 2001-09-21 2001-09-21 2001-11-01           1
    #3  a 2001-10-30       <NA>       <NA>           1
    #4  a 2001-11-10 2001-11-10 2001-11-10           1
    #5  a 2001-12-20       <NA>       <NA>           0
    #6  b 2002-12-22 2002-12-22 2002-12-22           1
    #7  b 2003-04-30 2003-04-30 2003-06-01           1
    #8  b 2003-05-10       <NA>       <NA>           1
    

    以data.table形式...

    dt <- setDT(df)      
    dt[, between_any := inrange(visit_date, start, end), 
         by = id]
    

    【讨论】:

    • 这太完美了!在我的任何研究中,我都没有遇到过 in_range。非常感谢。
    【解决方案2】:

    我的回答并不像我想要的那样“漂亮”,但它可以让你到达你想要去的地方。

    我首先将您的日期转换为日期:

    library(lubridate)
    library(dplyr)
    library(tibble)
    library(tidyr)
    library(purrr)
    
    df <- data.frame(id = c("a","a","a","a","a","b","b","b"),
                     visit_date = c("2001-08-22","2001-09-21","2001-10-30","2001-11-10","2001-12-20","2002-12-22", "2003-04-30","2003-05-10"),
                     start = c(NA,"2001-09-21",NA,"2001-11-10",NA,"2002-12-22", "2003-04-30",NA),
                     end = c(NA, "2001-11-01",NA,"2001-11-10",NA,"2002-12-22","2003-06-01",NA)) %>%
      mutate_at(-1,as.Date)
    
    > df
      id visit_date      start        end
    1  a 2001-08-22       <NA>       <NA>
    2  a 2001-09-21 2001-09-21 2001-11-01
    3  a 2001-10-30       <NA>       <NA>
    4  a 2001-11-10 2001-11-10 2001-11-10
    5  a 2001-12-20       <NA>       <NA>
    6  b 2002-12-22 2002-12-22 2002-12-22
    7  b 2003-04-30 2003-04-30 2003-06-01
    8  b 2003-05-10       <NA>       <NA>
    

    接下来我为每个组创建一个间隔列表:

    df_intervals <- df %>% 
      mutate_at(-1,as.Date) %>%
      filter(!is.na(start),
             !is.na(end)) %>%
      mutate(interval = start %--% end) %>%
      select(id,interval) %>%
      group_by(id)
    
    > df_intervals
    # A tibble: 4 x 2
    # Groups:   id [2]
      id    interval                      
      <fct> <S4: Interval>                
    1 a     2001-09-21 UTC--2001-11-01 UTC
    2 a     2001-11-10 UTC--2001-11-10 UTC
    3 b     2002-12-22 UTC--2002-12-22 UTC
    4 b     2003-04-30 UTC--2003-06-01 UTC
    

    最后,我将区间数据加入到基于id的原始数据中,并在区间内搜索visit_date

    df_output <- df %>% as.tbl() %>%
      inner_join(df_intervals) %>%
      mutate(between_any = map2_lgl(visit_date,interval,~ .x >= int_start(.y) & .x <= int_end(.y))) %>%
      group_by(id,visit_date,start,end) %>%
      summarise(between_any = as.numeric(any(between_any)))
    
    > df_output
    # A tibble: 8 x 5
    # Groups:   id, visit_date, start [8]
      id    visit_date start      end        between_any
      <fct> <date>     <date>     <date>           <dbl>
    1 a     2001-08-22 NA         NA                   0
    2 a     2001-09-21 2001-09-21 2001-11-01           1
    3 a     2001-10-30 NA         NA                   1
    4 a     2001-11-10 2001-11-10 2001-11-10           1
    5 a     2001-12-20 NA         NA                   0
    6 b     2002-12-22 2002-12-22 2002-12-22           1
    7 b     2003-04-30 2003-04-30 2003-06-01           1
    8 b     2003-05-10 NA         NA                   1
    

    【讨论】:

    • 感谢@Wil 的所有帮助!
    【解决方案3】:

    另一种可能是:

    df %>% 
     rowid_to_column() %>%
     full_join(df %>%
                filter(!is.na(start) & !is.na(end)) %>%
                mutate(interval = interval(ymd(start), ymd(end))) %>%
                select(id, interval), by = c("id" = "id")) %>%
     group_by(rowid, id) %>%
     summarise(between_any = max(ymd(visit_date) %within% interval * 1)) %>%
     left_join(df %>%
                rowid_to_column(), by = c("rowid" = "rowid",
                                          "id" = "id")) %>%
     ungroup() %>%
     select(-rowid)
      id    between_any visit_date start      end       
      <fct>       <dbl> <fct>      <fct>      <fct>     
    1 a               0 2001-11-08 <NA>       <NA>      
    2 a               1 2001-09-21 2001-09-21 2001-11-01
    3 a               1 2001-10-30 <NA>       <NA>      
    4 a               1 2001-11-10 2001-11-10 2001-11-10
    5 a               0 2001-12-20 <NA>       <NA>      
    6 b               1 2002-12-22 2002-12-22 2002-12-22
    7 b               1 2003-04-30 2003-04-30 2003-06-01
    8 b               1 2003-05-10 <NA>       <NA> 
    

    在这里,首先创建区间变量,然后基于“id”执行完全连接。其次,它检查“visit_date”是否在每个“id”和“rowid”的任何间隔内。最后,它将结果与原始数据连接起来。

    【讨论】:

    • @Wil 谢谢你的关注,更新了帖子:)
    猜你喜欢
    • 1970-01-01
    • 2018-07-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-10-27
    • 2021-12-18
    相关资源
    最近更新 更多