R：data.table：使用随时间引用的聚合答案

【问题标题】：R: data.table: aggregation using referencing over timeR：data.table：使用随时间引用的聚合
【发布时间】：2019-04-29 07:01:53
【问题描述】：

我有一个带句点的数据集

active <- data.table(id=c(1,1,2,3), beg=as.POSIXct(c("2018-01-01 01:10:00","2018-01-01 01:50:00","2018-01-01 01:50:00","2018-01-01 01:50:00")), end=as.POSIXct(c("2018-01-01 01:20:00","2018-01-01 02:00:00","2018-01-01 02:00:00","2018-01-01 02:00:00")))
> active
   id                 beg                 end 
1:  1 2018-01-01 01:10:00 2018-01-01 01:20:00 
2:  1 2018-01-01 01:50:00 2018-01-01 02:00:00    
3:  2 2018-01-01 01:50:00 2018-01-01 02:00:00    
4:  3 2018-01-01 01:50:00 2018-01-01 02:00:00

在此期间 id 处于活动状态。我想汇总 ids 并确定

中的每个点

time <- data.table(seq(from=min(active$beg),to=max(active$end),by="mins"))

处于非活动状态的 ID 数量以及它们激活之前的平均分钟数。也就是说，理想情况下，表格看起来像

>ans
                   time  inactive av.time
 1: 2018-01-01 01:10:00         2      30
 2: 2018-01-01 01:11:00         2      29
...
50: 2018-01-01 02:00:00         0       0

我相信这可以使用data.table 来完成，但我无法弄清楚获取时差的语法。

【问题讨论】：

可能相关：stackoverflow.com/q/52614468/2204410
谢谢！它有助于第一部分，但使第二部分保持打开状态。你知道有什么可以帮助的吗？

标签： r data.table aggregation

【解决方案1】：

使用dplyr，我们可以通过虚拟变量连接来创建time 和active 的笛卡尔积。 inactive 和 av.time 的定义可能并不完全符合您的要求，但它应该可以帮助您入门。如果您的数据非常大，我同意data.table 将是更好的处理方式。

library(tidyverse)

time %>% 
  mutate(dummy = TRUE) %>% 
  inner_join({
    active %>% 
      mutate(dummy = TRUE)
    #join by the dummy variable to get the Cartesian product
  }, by = c("dummy" = "dummy")) %>% 
  select(-dummy) %>% 
  #define what makes an id inactive and the time until it becomes active
  mutate(inactive = time < beg | time > end,
         TimeUntilActive = ifelse(beg > time, difftime(beg, time, units = "mins"), NA)) %>% 
  #group by time and summarise
  group_by(time) %>% 
  summarise(inactive = sum(inactive),
            av.time = mean(TimeUntilActive, na.rm = TRUE))

# A tibble: 51 x 3
        time            inactive av.time
        <dttm>            <int>   <dbl>
1 2018-01-01 01:10:00        3      40
2 2018-01-01 01:11:00        3      39
3 2018-01-01 01:12:00        3      38
4 2018-01-01 01:13:00        3      37
5 2018-01-01 01:14:00        3      36
6 2018-01-01 01:15:00        3      35
7 2018-01-01 01:16:00        3      34
8 2018-01-01 01:17:00        3      33
9 2018-01-01 01:18:00        3      32
10 2018-01-01 01:19:00        3      31

【讨论】：

谢谢——虚拟解决方案很有趣，我会研究一下。但是，数据确实很大，跨越了几年，几千个id。如果有人有data.tablesolution，这将有很大帮助。