【问题标题】:Find duration and create a thresh if the gap exceeds a certain time如果间隙超过特定时间,则查找持续时间并创建阈值
【发布时间】:2020-04-01 00:25:39
【问题描述】:

目标:

我有一个数据集 df,我想按 ID 分组并根据特定条件查找持续时间:Focus == True、Read == True 和 ID != ""。但是,我不想聚合 ID,因为我希望将它们放在自己单独的“块”中在输出下方。

ID            Date                   Focus        Read


A             1/2/2020 5:00:00 AM    TRUE         TRUE
A             1/2/2020 5:00:05 AM    TRUE         TRUE
              1/3/2020 6:00:00 AM    TRUE
              1/3/2020 6:00:05 AM    TRUE         
B             1/4/2020 7:00:00 AM    TRUE         TRUE
B             1/4/2020 7:00:05 AM    TRUE         TRUE
B             1/4/2020 7:20:00 AM    TRUE         TRUE
B             1/4/2020 7:20:10 AM    TRUE         TRUE
A             1/2/2020 7:30:00 AM    TRUE         TRUE
A             1/2/2020 7:30:20 AM    TRUE         TRUE

我想要这个输出:

ID                          Duration              Start                    End

A                           5 sec                 1/2/2020 5:00:00 AM     1/2/2020 5:00:05 AM
B                           5 sec                 1/4/2020 7:00:00 AM     1/4/2020 7:00:05 AM    
B                           10 sec                1/4/2020 7:20:00 AM     1/4/2020 7:20:10 AM
A                           20 sec                1/2/2020 7:30:00 AM     1/2/2020 7:30:20 AM     

输出:

structure(list(ID = structure(c(2L, 2L, 1L, 1L, 3L, 3L, 3L, 3L, 
2L, 2L), .Label = c("", "A", "B"), class = "factor"), Date = structure(c(1L, 
2L, 5L, 6L, 7L, 8L, 9L, 10L, 3L, 4L), .Label = c("1/2/2020 5:00:00 AM", 
"1/2/2020 5:00:05 AM", "1/2/2020 7:30:00 AM", "1/2/2020 7:30:20 AM", 
"1/3/2020 6:00:00 AM", "1/3/2020 6:00:05 AM", "1/4/2020 7:00:00 AM", 
"1/4/2020 7:00:05 AM", "1/4/2020 7:20:00 AM", "1/4/2020 7:20:10 AM"
), class = "factor"), Focus = structure(c(1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), .Label = "True ", class = "factor"), Read = structure(c(2L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("", "True "), class = "factor")), class =    "data.frame", row.names = c(NA, 
-10L))

这很好用,但不是聚合 ID,而是如何将它们分开:

 library(dplyr)
 library(lubridate)
 df %>% 
 filter(as.logical(trimws(Read)), as.logical(trimws(Focus))) %>%
 mutate(Date = mdy_hms(Date)) %>%
 group_by(ID) %>% 
 summarise(Duration = difftime(last(Date), first(Date), units = "secs"))

欢迎提出任何建议。

【问题讨论】:

  • 所以,忽略空白IDs?您是否有理由使用"True" 等字符串而不是logical 变量与TRUEFALSE R natives?
  • 是的,忽略空白 ID。我可以使用TRUE,FALSE。我会编辑这个。

标签: r dplyr tidyverse


【解决方案1】:

我们可以去掉ReadFocus中的空白值,转换Date,创建阈值为4分钟的单独分组,得到lastfirst值之间的时间差。

library(dplyr)

df %>% 
  filter(as.logical(trimws(Read)), as.logical(trimws(Focus))) %>%
  mutate(Date = lubridate::mdy_hms(Date)) %>% 
  group_by(grp = cumsum(abs(difftime(Date, lag(Date, 
                            default = first(Date)), units = "mins")) > 4)) %>%
  summarise(ID = first(ID),
            Duration = difftime(last(Date), first(Date), units = "secs"), 
            Start = first(Date), 
            End = last(Date)) %>%
  select(-grp)


#  ID    Duration Start               End                
#  <fct> <drtn>   <dttm>              <dttm>             
#1 A      5 secs  2020-01-02 05:00:00 2020-01-02 05:00:05
#2 B      5 secs  2020-01-04 07:00:00 2020-01-04 07:00:05
#3 B     10 secs  2020-01-04 07:20:00 2020-01-04 07:20:10
#4 A     20 secs  2020-01-02 07:30:00 2020-01-02 07:30:20

【讨论】:

    猜你喜欢
    • 2022-09-27
    • 2018-04-23
    • 2019-07-04
    • 1970-01-01
    • 2014-05-10
    • 1970-01-01
    • 1970-01-01
    • 2021-11-20
    • 1970-01-01
    相关资源
    最近更新 更多