【问题标题】:Count consecutive days by group按组计算连续天数
【发布时间】:2020-05-18 20:49:12
【问题描述】:

我希望添加一个字段来计算每个组中的连续天数(由 id 字段捕获)。我从这个开始:

dt <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = c("1/01/2000", "2/01/2000", "2/01/2000", 
"5/01/2000", "6/01/2000", "7/01/2000", "8/01/2000", "13/01/2000", "14/01/2000", 
"18/01/2000", "19/01/2000", "21/01/2000", "25/01/2000", "26/01/2000", 
"30/01/2000", "31/01/2000")), .Names = c("id", "date"), 
row.names = c(NA, -16L), class = "data.frame")

并希望获得以下信息,最好使用 data.table:

id date       cons
1  1/01/2000  0
1  2/01/2000  1
1  2/01/2000  1
1  5/01/2000  0
1  6/01/2000  1
1  7/01/2000  2
1  8/01/2000  3
2 13/01/2000  0
2 14/01/2000  1
2 18/01/2000  0
2 19/01/2000  1
2 21/01/2000  0
2 25/01/2000  0
2 26/01/2000  1
2 30/01/2000  0
2 31/01/2000  1

【问题讨论】:

  • 你能解释一下为什么第 3 行的 cons 是 1 吗?

标签: r date data.table


【解决方案1】:

这是使用dplyr的一种方法

library(dplyr)

dt %>%
  mutate(date = as.Date(date, "%d/%m/%Y")) %>%
  group_by(id) %>%
  group_by(grp = cumsum(c(TRUE, diff(date) > 1)), add = TRUE) %>%
  mutate(cons = as.integer(date - first(date))) %>%
  ungroup %>%
  select(-grp)

#      id date        cons
#   <int> <date>     <int>
# 1     1 2000-01-01     0
# 2     1 2000-01-02     1
# 3     1 2000-01-02     1
# 4     1 2000-01-05     0
# 5     1 2000-01-06     1
# 6     1 2000-01-07     2
# 7     1 2000-01-08     3
# 8     2 2000-01-13     0
# 9     2 2000-01-14     1
#10     2 2000-01-18     0
#11     2 2000-01-19     1
#12     2 2000-01-21     0
#13     2 2000-01-25     0
#14     2 2000-01-26     1
#15     2 2000-01-30     0
#16     2 2000-01-31     1

当您标记此data.table 时,同样可以翻译为data.table

library(data.table)

setDT(dt)
dt[, date := as.Date(date, "%d/%m/%Y")]
dt[, cons := as.integer(date - first(date)), .(id, cumsum(c(TRUE, diff(date) > 1)))]

【讨论】:

    【解决方案2】:

    我可能会让事情变得复杂,但如果您的数据集很大,这里应该是一个更快的选择:

    setDT(dt)[, date := as.Date(date, format="%d/%m/%Y")]
    
    #identify consecutive dates
    dt[, c("cons", "d", "rr") := .(0L, 
        d <- c(FALSE, diff(date) == 1L), 
        rowid(rleid(id, d)))]
    
    #update rows with consecutive dates
    idx <- dt[(d), which=TRUE]
    set(dt, idx, "cons", dt[idx, rr])
    
    #handle identical dates
    ix <- dt[id==shift(id) & c(FALSE, diff(date)==0L), which=TRUE]
    set(dt, ix, "cons", dt[ix - 1L, cons])
    

    输出:

        id       date cons     d rr
     1:  1 2000-01-01    0 FALSE  1
     2:  1 2000-01-02    1  TRUE  1
     3:  1 2000-01-02    1 FALSE  1
     4:  1 2000-01-05    0 FALSE  2
     5:  1 2000-01-06    1  TRUE  1
     6:  1 2000-01-07    2  TRUE  2
     7:  1 2000-01-08    3  TRUE  3
     8:  2 2000-01-13    0 FALSE  1
     9:  2 2000-01-14    1  TRUE  1
    10:  2 2000-01-18    0 FALSE  1
    11:  2 2000-01-19    1  TRUE  1
    12:  2 2000-01-21    0 FALSE  1
    13:  2 2000-01-25    0 FALSE  2
    14:  2 2000-01-26    1  TRUE  1
    15:  2 2000-01-30    0 FALSE  1
    16:  2 2000-01-31    1  TRUE  1
    

    【讨论】:

      猜你喜欢
      • 2021-11-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-25
      • 2020-01-23
      • 2019-03-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多