【问题标题】:R find if value in column exceeds a threshold between two time periods from second dfR查找列中的值是否超过第二个df的两个时间段之间的阈值
【发布时间】:2019-02-11 14:04:17
【问题描述】:

希望我能充分解释我正在尝试做的事情。我有 df1 与活动的开始和结束时间的值。但是,我想使用这些时间来查看船的速度(df2)是否超过两个钓鱼活动之间的某个阈值,以决定它们应该是单独的活动(即船已经驶往新位置)还是相同的活动。

df1 <- data.frame(
vessel_pln=c(rep("AU89",5)),
start_time=c("2018-11-02 05:14:26 GMT","2018-11-02 07:48:16 GMT","2018-11-02 09:03:28 GMT","2018-11-02 10:17:25 GMT","2018-11-05 06:39:12 GMT"),
start_lat=c(55.69713617,55.69693433,55.69539050,55.69043650,55.69103567), 
start_lon=c(-5.65051533,-5.65031783,-5.65317850,-5.65859250,-5.65830600),
end_time=c("2018-11-02 06:54:37 GMT","2018-11-02 08:55:24 GMT","2018-11-02 10:00:14 GMT","2018-11-02 11:55:47 GMT","2018-11-05 08:33:35 GMT"),
end_lat=c(55.69462700,55.69539367,55.69454683,55.69370050,55.69302200),
end_lon=c(-5.65454983,-5.65317550,-5.65567667,-5.65628133,-5.65317550),
activity=c(1,2,3,4,5),
new_activity=c(rep("NO",5)))

图书馆(时间) tt

df2 <- data.frame(
vessel_pln=c(rep("AU89",200)),
GPSTime=c(chron(rep("2/11/18", length = length(tt)), tt)),
Speed=c(runif(200,0,3)))
df2 <- as.POSIXct(df2$GPSTime,format="(%d/%m/%y %H%M%S)",tz="GMT")
df2[108, "Speed"] <- 3.2 

我想知道 [i] 行的 'end_time' (df1) 和 [i+1] 行的 'start_time' (df1) 之间的 'Speed' (df2) > 3。如果是,则将“new_activity”(df1)列更改为“YES”。

通过以上数据,我应该得到以下结果:

df3 <- data.frame(
vessel_pln=c(rep("AU89",5)),
start_time=c("2018-11-02 05:14:26 GMT","2018-11-02 07:48:16 GMT","2018-11-02 09:03:28 GMT","2018-11-02 10:17:25 GMT","2018-11-02 16:39:12 GMT"),
start_lat=c(55.69713617,55.69693433,55.69539050,55.69043650,55.69103567), 
start_lon=c(-5.65051533,-5.65031783,-5.65317850,-5.65859250,-5.65830600),
end_time=c("2018-11-02 06:54:37 GMT","2018-11-02 08:55:24 GMT","2018-11-02 10:00:14 GMT","2018-11-02 11:55:47 GMT","2018-11-02 18:33:35 GMT"),
end_lat=c(55.69462700,55.69539367,55.69454683,55.69370050,55.69302200),
end_lon=c(-5.65454983,-5.65317550,-5.65567667,-5.65628133,-5.65317550),
activity=c(1,2,3,4,5),
new_activity=c("NO","NO","YES","NO","NO")))

【问题讨论】:

  • 不清楚你所说的'速度'(df2)> 3是什么意思,因为这个值总是
  • 抱歉,数据框只是为了让您了解数据结构,它们没有包含我想要提取的内容的清晰示例。我将对其进行编辑并使其更适用。

标签: r


【解决方案1】:

这也是您可以使用data.table 处理此问题的方法(以及一些magrittr 以提高可读性);即使对于较大的数据集也应该很快:

library(data.table)
library(magrittr)

col_names <- names(df1)

df1 <- setDT(df1)[, lapply(.SD, as.character)] %>%
  .[, `:=` (end_join = as.POSIXct(end_time),
            start_join = shift(as.POSIXct(start_time), type = "lead")), by = vessel_pln] %>%
  .[is.na(start_join), start_join := as.POSIXct(as.character(end_time))]

df2 <- setDT(df2)[, lapply(.SD, as.character)][, `:=` (GPSTime = as.POSIXct(GPSTime))]

final <- df2[df1, on = .(GPSTime <= start_join, GPSTime >= end_join, vessel_pln = vessel_pln)] %>%
  .[, new_activity := as.character(ifelse(any(Speed > 3), "YES", "NO")), by = activity] %>%
  .[!duplicated(activity), ..col_names] %>%
  .[is.na(new_activity), new_activity := "NO"]

请注意,我已经稍微修改了您的数据示例,因为否则无法找到日期之间的匹配项(在一个 df 中,您有 2 月 11 日,在另一个 11 月 2 日):

library(chron) 

df1 <- data.frame(
  vessel_pln=c(rep("AU89",5)),
  start_time=c("2018-11-02 05:14:26 GMT","2018-11-02 07:48:16 GMT","2018-11-02 09:03:28 GMT","2018-11-02 10:17:25 GMT","2018-11-05 06:39:12 GMT"),
  start_lat=c(55.69713617,55.69693433,55.69539050,55.69043650,55.69103567), 
  start_lon=c(-5.65051533,-5.65031783,-5.65317850,-5.65859250,-5.65830600),
  end_time=c("2018-11-02 06:54:37 GMT","2018-11-02 08:55:24 GMT","2018-11-02 10:00:14 GMT","2018-11-02 11:55:47 GMT","2018-11-05 08:33:35 GMT"),
  end_lat=c(55.69462700,55.69539367,55.69454683,55.69370050,55.69302200),
  end_lon=c(-5.65454983,-5.65317550,-5.65567667,-5.65628133,-5.65317550),
  activity=c(1,2,3,4,5),
  new_activity=c(rep("NO",5)))

tt <- times(1:200/288)

df2 <- data.frame(
  vessel_pln=c(rep("AU89",200)),
  GPSTime=c(chron(rep("11/2/18", length = length(tt)), tt)),
  Speed=c(runif(200,0,3)))

df2$GPSTime <- as.POSIXct(df2$GPSTime,format="(%d/%m/%y %H%M%S)",tz="GMT")
df2[108, "Speed"] <- 3.2 

现在输出实际上是所有 NO,因为只有 1 个案例 Speed > 3,这不会介于任何 end_time 和下一个 start_time 之间:

   vessel_pln              start_time   start_lat   start_lon                end_time     end_lat     end_lon activity new_activity
1:       AU89 2018-11-02 05:14:26 GMT 55.69713617 -5.65051533 2018-11-02 06:54:37 GMT   55.694627 -5.65454983        1           NO
2:       AU89 2018-11-02 07:48:16 GMT 55.69693433 -5.65031783 2018-11-02 08:55:24 GMT 55.69539367  -5.6531755        2           NO
3:       AU89 2018-11-02 09:03:28 GMT  55.6953905  -5.6531785 2018-11-02 10:00:14 GMT 55.69454683 -5.65567667        3           NO
4:       AU89 2018-11-02 10:17:25 GMT  55.6904365  -5.6585925 2018-11-02 11:55:47 GMT  55.6937005 -5.65628133        4           NO
5:       AU89 2018-11-05 06:39:12 GMT 55.69103567   -5.658306 2018-11-05 08:33:35 GMT   55.693022  -5.6531755        5           NO

但是,如果您稍微修改一下,并将df1˛end_time 的第三行中的09:44:00 替换为09:44:00,您会得到:

   vessel_pln              start_time   start_lat   start_lon                end_time     end_lat     end_lon activity new_activity
1:       AU89 2018-11-02 05:14:26 GMT 55.69713617 -5.65051533 2018-11-02 06:54:37 GMT   55.694627 -5.65454983        1           NO
2:       AU89 2018-11-02 07:48:16 GMT 55.69693433 -5.65031783 2018-11-02 08:55:24 GMT 55.69539367  -5.6531755        2           NO
3:       AU89 2018-11-02 09:03:28 GMT  55.6953905  -5.6531785 2018-11-02 09:44:00 GMT 55.69454683 -5.65567667        3          YES
4:       AU89 2018-11-02 10:17:25 GMT  55.6904365  -5.6585925 2018-11-02 11:55:47 GMT  55.6937005 -5.65628133        4           NO
5:       AU89 2018-11-05 06:39:12 GMT 55.69103567   -5.658306 2018-11-05 08:33:35 GMT   55.693022  -5.6531755        5           NO

【讨论】:

    【解决方案2】:

    首先,为了比较df1$start_timedf2$GPSTime,这两个需要相同的类型。

    df1$start_time <- as.POSIXct(as.character(df1$start_time),format = "%Y-%m-%d %H:%M:%S", tz="GMT")
    df1$end_time <- as.POSIXct(as.character(df1$end_time),format = "%Y-%m-%d %H:%M:%S", tz="GMT")
    
    df2$GPSTime <- as.POSIXct(as.character(df2$GPSTime), format="(%d/%m/%y %H:%M:%S)", tz= 'GMT')
    

    然后,您可以合并df1df2并比较不同的时间。然后过滤以保持美好时光。

    temp <- df1 %>% 
      left_join(df2, by = 'vessel_pln') %>% 
      mutate(BETWEEN = (GPSTime >= start_time & GPSTime < end_time)) %>% 
      filter(BETWEEN == TRUE)
      #filter(Speed > 3)
    

    您可以检查它是否有效,最后过滤以仅保持 Speed > 3(我不这样做,因为我的示例数据集中没有 Speed > 3)。

    temp %>% 
      filter(activity == 1) %>% 
      select(start_time, end_time, GPSTime, Speed) %>% 
      head()
    
    #            start_time            end_time             GPSTime     Speed
    # 1 2018-11-02 05:14:26 2018-11-02 06:54:37 2018-11-02 05:15:00 0.8461418
    # 2 2018-11-02 05:14:26 2018-11-02 06:54:37 2018-11-02 05:20:00 0.8610450
    # 3 2018-11-02 05:14:26 2018-11-02 06:54:37 2018-11-02 05:25:00 2.8171262
    # 4 2018-11-02 05:14:26 2018-11-02 06:54:37 2018-11-02 05:30:00 1.8165029
    # 5 2018-11-02 05:14:26 2018-11-02 06:54:37 2018-11-02 05:35:00 2.0697528
    # 6 2018-11-02 05:14:26 2018-11-02 06:54:37 2018-11-02 05:40:00 0.5855299
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-09-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-04-05
      • 1970-01-01
      • 2021-03-28
      相关资源
      最近更新 更多