【问题标题】:Identifying if time intervals overlap with multiple patient IDs, in r识别时间间隔是否与多个患者 ID 重叠,在 r
【发布时间】:2020-02-04 23:16:14
【问题描述】:

我有来自多个诊所的入院和出院数据,需要确定同一诊所的 2 名以上具有相同病原体 DNA 菌株的患者在入院时是否存在重叠。在 R 工作。

只有 5 个变量:研究 ID、日期、日期、DNA 菌株和诊所。每行是一次单独的访问,这意味着“ID”变量可以重复。我需要确定有多少“重叠”,这意味着具有相同病原体 DNA 的患者在同一诊所就诊时间。

这是一个(看似)简单的问题,并且在本网站上进行了很多讨论。当每个 ID 都是唯一的时,我能够识别重叠。我的具体问题是如何处理重复的 ID。在下面的代码中,我提供了示例数据和在每个 ID 唯一时有效的代码(您可以通过在代码的第一行将 seq(1:20)),each=2) 更改为 seq(1:40)),each=1) 来快速测试)。如果我使用重复的 ID 运行此代码,它会将所有访问标记为重叠 = 1,对于具有任何重叠的 ID,无论它是否是实际重叠的访问。

我知道重叠的日期间隔是很多讨论的话题,所以请相信我已经彻底查看了我能找到的所有内容,并且几乎除了最后一部分之外。非常感谢任何建议。

谢谢!

#Set globals
set.seed(8126) 
library(lubridate); library(data.table)

#Example Data
have<-data.frame(rep(paste0("k",seq(1:20)),each=2),sample(seq(as.Date('2015/01/01'), as.Date('2020/01/01'), by="day"), 40))
names(have)<-c("id","datein")
have$dateout<-have$datein+40 #arbitrarily add 40 days to admission date
have$dnastrain<-as.vector(replicate(20,rep(sample(c("Type 1","Type 2","Type 3"),1),each=2)))
have$clinic<-sample(c("A","B","C","D"),40, replace=TRUE)


#Code that works if each ID is unique
setDT(have); setkey(have,datein, dateout) #Convert to DT and set date in/out as keys
overlaps<-unique(foverlaps(have, have)[id!=i.id & dnastrain==i.dnastrain & clinic==i.clinic, id]) #find overlaps

have[,`:=`(overlap=0)][id %in% overlaps, overlap:=1][order(datein)] #Identify overlaps

【问题讨论】:

    标签: r


    【解决方案1】:

    这是一个 dplyr 方法。

    
    have$rownum <- 1:nrow(have)
    
     crossover <- have %>% 
       inner_join(have, by = c("clinic", "dnastrain"), suffix=(c(".a",".b"))) %>% 
       filter(datein.a<=dateout.b & 
              dateout.a>=datein.b &
              id.a!=id.b & 
              rownum.a < rownum.b
              )
    
    
    crossover
      id.a   datein.a  dateout.a dnastrain clinic rownum.a id.b   datein.b  dateout.b rownum.b
    1   k1 2017-11-02 2017-12-12    Type 3      B        1  k18 2017-10-03 2017-11-12       35
    2  k10 2016-02-15 2016-03-26    Type 3      A       19  k13 2016-03-22 2016-05-01       26
    3  k17 2017-08-06 2017-09-15    Type 2      C       33  k20 2017-09-02 2017-10-12       39
    

    【讨论】:

      【解决方案2】:

      我想出了一个使用purrr 的方法来做这件事有点不同:

      
      participant_in <- have$datein
      participant_out <- have$dateout
      
      # For each participant, create a vector of days at which they were in a clinic
      days <- map2(participant_in, participant_out, ~ seq(from = ymd(.x), to = ymd(.y), by = 'days'))
      ids <- map2(have$id, days, ~ rep(.x, length(.y)))
      
      days <- days %>% reduce(c)
      ids <- ids %>% unlist()
      
      participant_tib <- tibble(id = ids, day = days) %>%
        left_join(have, by = 'id') %>%
        select(-datein, -dateout) 
      
      participant_tib %>%
        group_by(day, dnastrain, clinic) %>%
        count() %>%
        arrange(desc(n))
      
      

      使用上面的 ^,您可以检查同一诊所在哪几天有几名患有相同菌株的参与者。从那里,您可以filter 在 n > 1 的日子里,并检查参与者的 ID。

      【讨论】:

        【解决方案3】:

        您可以使用联接代替%in%

        setDT(have, key=c("datein","dateout"))
        overlaps <- unique(foverlaps(have, have)[clinic==i.clinic, dnastrain==i.dnastrain & id!=i.id])
        
        cols <- copy(names(have))
        have[, overlap := 0][
            overlaps, on=cols, overlap := 1][
                order(datein)] 
        

        输出:

             id     datein    dateout dnastrain clinic overlap
         1:  k6 2015-01-01 2015-02-10    Type 1      D       0
         2:  k6 2015-01-20 2015-03-01    Type 1      D       0
         3:  k9 2015-04-09 2015-05-19    Type 2      B       1
         4: k11 2015-04-12 2015-05-22    Type 2      B       1
         5: k10 2015-04-19 2015-05-29    Type 3      C       0
         6: k10 2015-07-03 2015-08-12    Type 3      A       0
         7:  k4 2015-08-06 2015-09-15    Type 1      C       0
         8: k16 2015-08-26 2015-10-05    Type 2      A       0
         9: k13 2016-05-20 2016-06-29    Type 3      D       1
        10: k19 2016-05-26 2016-07-05    Type 3      D       1
        11: k15 2016-06-23 2016-08-02    Type 3      A       0
        12:  k7 2016-06-29 2016-08-08    Type 3      B       0
        13: k18 2016-07-18 2016-08-27    Type 2      B       0
        14: k13 2016-08-11 2016-09-20    Type 3      A       0
        15:  k1 2016-09-24 2016-11-03    Type 3      C       0
        16: k18 2016-09-29 2016-11-08    Type 2      B       0
        17: k12 2017-03-13 2017-04-22    Type 3      B       0
        18:  k7 2017-03-24 2017-05-03    Type 3      C       0
        19: k14 2017-05-28 2017-07-07    Type 3      B       0
        20:  k3 2017-06-05 2017-07-15    Type 2      B       0
        21: k17 2017-07-06 2017-08-15    Type 3      A       0
        22: k17 2017-09-08 2017-10-18    Type 3      B       1
        23:  k2 2017-09-30 2017-11-09    Type 3      B       1
        24: k15 2017-10-07 2017-11-16    Type 3      D       0
        25:  k8 2018-01-01 2018-02-10    Type 1      C       0
        26:  k5 2018-07-07 2018-08-16    Type 2      B       0
        27: k20 2018-07-18 2018-08-27    Type 1      A       0
        28:  k5 2018-08-30 2018-10-09    Type 2      B       0
        29: k16 2018-09-22 2018-11-01    Type 2      D       0
        30: k14 2018-09-24 2018-11-03    Type 3      B       0
        31:  k2 2018-11-02 2018-12-12    Type 3      A       1
        32: k12 2018-11-13 2018-12-23    Type 3      A       1
        33: k19 2018-11-21 2018-12-31    Type 3      D       0
        34:  k3 2018-11-30 2019-01-09    Type 2      A       0
        35:  k1 2018-12-14 2019-01-23    Type 3      A       1
        36:  k9 2019-03-13 2019-04-22    Type 2      D       0
        37:  k8 2019-07-18 2019-08-27    Type 1      C       1
        38: k20 2019-08-10 2019-09-19    Type 1      C       1
        39: k11 2019-08-30 2019-10-09    Type 2      B       0
        40:  k4 2019-09-04 2019-10-14    Type 1      B       0
             id     datein    dateout dnastrain clinic overlap
        

        数据:

        set.seed(8126) 
        library(data.table)
        
        have<-data.frame(rep(paste0("k",seq(1:20)),each=2),sample(seq(as.Date('2015/01/01'), as.Date('2020/01/01'), by="day"), 40))
        names(have)<-c("id","datein")
        have$dateout<-have$datein+40 #arbitrarily add 40 days to admission date
        have$dnastrain<-as.vector(replicate(20,rep(sample(c("Type 1","Type 2","Type 3"),1),each=2)))
        have$clinic<-sample(c("A","B","C","D"),40, replace=TRUE)
        

        【讨论】:

          【解决方案4】:

          这是我今天使用dplyrpurrr 学到的解决方案

          library(dplyr)
          library(purrr)
          library(lubridate)
          
          
          have %>% 
            mutate(interval=interval(datein,dateout)) %>%
            group_by(dnastrain,clinic) %>% 
            mutate(overlap = purrr::map_int(row_number(),
            ~+any(datein[.x] %within% interval[-.x] | dateout[.x] %within% interval[-.x])))
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 2014-11-22
            • 1970-01-01
            • 1970-01-01
            • 2016-06-02
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多