【问题标题】:Aggregate dates into date intervals / periods in R将日期聚合为 R 中的日期间隔/期间
【发布时间】:2020-04-16 03:46:32
【问题描述】:

我有以下示例数据:

require(tibble)
sample_data <- tibble(
                      emp_name = c("john", "john", "john", "john","john","john", "john"), 
                      task = c("carpenter", "carpenter","carpenter", "painter", "painter", "carpenter", "carpenter"),
                      date_stamp = c("2019-01-01","2019-01-02", "2019-01-03", "2019-01-07", "2019-01-08", "2019-01-30", "2019-02-02")
                      )

为此,我需要根据日期汇总成间隔。

规则是:如果为同一属性列出的下一个 date_stamp 之间没有日期,则应将其汇总。 否则,date_stamp_fromdate_stamp_to 应该等于 date_stamp

desired_result <- tibble(
                  emp_name = c("john", "john","john", "john"),
                  task = c("carpenter","painter", "carpenter", "carpenter"),
                  date_stamp_from = c("2019-01-01","2019-01-07", "2019-01-30", "2019-02-02"),
                  date_stamp_to = c("2019-01-03","2019-01-08", "2019-01-30", "2019-02-02"),
                  count_dates = c(3,2,1,1)
)

解决这个问题的最有效方法是什么?原始数据集大约有 10000 条记录。

【问题讨论】:

    标签: r aggregate-functions intervals


    【解决方案1】:

    我们可以使用diffcumsum来创建组并统计firstlastdate_stamp和每个组中的行数。

    library(dplyr)
    
    sample_data %>%
         mutate(date_stamp = as.Date(date_stamp)) %>%
         group_by(gr = cumsum(c(TRUE, diff(date_stamp) > 1))) %>%
         mutate(date_stamp_from = first(date_stamp), 
                date_stamp_to = last(date_stamp), 
                count_dates = n()) %>%
         slice(1L) %>%
         ungroup() %>%
         select(-gr, -date_stamp)
    
    # A tibble: 4 x 5
    #  emp_name task      date_stamp_from date_stamp_to count_dates
    #  <chr>    <chr>     <date>          <date>              <int>
    #1 john     carpenter 2019-01-01      2019-01-03              3
    #2 john     painter   2019-01-07      2019-01-08              2
    #3 john     carpenter 2019-01-30      2019-01-30              1
    #4 john     carpenter 2019-02-02      2019-02-02              1
    

    【讨论】:

    • 感谢这个非常优雅的解决方案!它与desired_result 匹配。对于我的实际数据集,我还有额外的 emp_name 和任务值,所以我编辑了 group_by 以包含它们:“ group_by(emp_name, task, gr = cumsum(c(TRUE, diff(date_stamp) > 1))) %>% "