【问题标题】:speed up sapply to find if time difference exceeds threshold加快 sapply 以查找时间差是否超过阈值
【发布时间】:2014-06-04 01:01:34
【问题描述】:

我需要sapply 根据时差是否超过某个阈值(在我的例子中,是由 for 循环设置的天数)返回一个布尔值列表。

示例数据(日期已使用as.Date 转换):

#DF called "held"
ID  Result  Start_Date
123 0   12/5/2013
123 0   12/12/2013
123 0   12/31/2013
123 0   4/22/2014
123 1   4/23/2014
654 0   9/3/2013
654 0   9/17/2013
98  0   10/18/2013
98  0   10/19/2013
98  2   12/20/2013
555 0   2/1/2014
555 0   3/2/2014
555 0   3/3/2014
66  1   1/12/2013

代码:

#empty vectors to be populated for plotting
a <- c()
b <- c()
for (n in 1:60){
#all rows where ID is not duplicated and Result is either 1 or 2 are FALSE
#all ID's where the difference between the min and max Start_Date (across multiple rows) exceeds the threshold are TRUE
  held$CHNS <-((!(!(held$ID %in% held$ID[duplicated(held$ID) | duplicated(held$ID, fromLast = TRUE)])&(held$Result %in% c(1,2)))) & (sapply(held$ID,function(x) max(held$Start_Date[held$ID == x]) - min(held$Start_Date[held$ID == x]) > n)))
#find percentage of Results 1 and 2 in entire CHNS population
  m <- length(held$Result[held$Result %in% c(1,2) & held$CHNS == TRUE])/nrow(held[held$CHNS == TRUE,])
#assign vector elements
  a[n] <- n
  b[n] <- m
}

当前的代码似乎是准确的,但速度极慢。有关如何改进的任何提示?我什至应该使用sapply 吗?谢谢!

【问题讨论】:

    标签: r for-loop sapply


    【解决方案1】:

    这可以很好地矢量化,如下所示。

    held <- read.table(text=
      'ID  Result  Start_Date
      123 0   12/5/2013
      123 0   12/12/2013
      123 0   12/31/2013
      123 0   4/22/2014
      123 1   4/23/2014
      654 0   9/3/2013
      654 0   9/17/2013
      98  0   10/18/2013
      98  0   10/19/2013
      98  2   12/20/2013
      555 0   2/1/2014
      555 0   3/2/2014
      555 0   3/3/2014
      66  1   1/12/2013', header=TRUE)
    
    held$Start_Date <- as.Date(held$Start_Date, '%m/%d/%Y')
    
    # Add a column giving the number of days spanned for the ID
    held$date.diff <- with(held, {
      ndays <- tapply(Start_Date, ID, function(x) diff(range(x)))
      ndays[match(ID, names(ndays))]
    })
    
    sapply(1:60, function(n) {
      with(held, {
        rule1 <- !duplicated(ID) & Result %in% 1:2
        rule2 <- date.diff  > n
        outcome <- !rule1 & rule2
        sum(outcome & Result %in% 1:2) / sum(outcome)
      })
    })
    
    #  [1] 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462
    #  [8] 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1538462 0.1818182
    # [15] 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182
    # [22] 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182 0.1818182
    # [29] 0.1818182 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
    # [36] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
    # [43] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
    # [50] 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000 0.2500000
    # [57] 0.2500000 0.2500000 0.2500000 0.2500000
    

    快速基准测试:

    microbenchmark(jbaums(), userNaN())
    
    # Unit: milliseconds
    #         expr        min         lq     median         uq        max neval
    #     jbaums()   1.994695   2.110046   2.164258   2.223137   3.685502   100
    #    userNaN() 110.448790 112.985603 114.911328 117.714080 489.052823   100
    

    【讨论】:

    • 优秀 - 接受,因为我宁愿在基础 R 中这样做。谢谢!
    【解决方案2】:

    一方面,我会首先在循环之外找到每个 ID 的差异。然后,如果需要,只需从 1:60 开始循环进行差异检查。我还将使用 dplyr 来计算差异,这应该会大大简化代码并可能使其更快。使用您的示例:

    require(dplyr)
    ID <- group_by(held, ID)
    Diff <- summarise(ID, Difference = (max(Start_Date) - min(Start_Date)))
    
    
    a <- 1:60
    b <- vector('numeric', 60)
    
    for n in (1:60) {
    b[n] <- mean (Diff$Difference > n) 
    }
    

    这应该给你一个向量 b,其中人口差异大于 n 的每个级别的时间百分比。

    【讨论】:

    • +1,但我目前受限于基础 R,所以我接受了较早的答案。
    猜你喜欢
    • 2020-03-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-11-05
    • 1970-01-01
    • 2022-10-01
    • 1970-01-01
    相关资源
    最近更新 更多