【问题标题】:Flag rows with interval overlap in r在 r 中标记具有间隔重叠的行
【发布时间】:2018-09-12 19:36:16
【问题描述】:

我有一个包含电视观看数据的 df 框架,我想对重叠观看运行 QC 检查。假设在同一天,同一家庭,对于每个人,每一分钟应该只记入一个电台或频道。

例如,我想标记第 8 行,第 9 行,因为一个独特家庭中的个人似乎不可能同时观看两个电视台 (62,67) (start_hour_minute)。我想知道有没有办法标记这些行? 一种按个人按天逐分钟查看的分类。

df <- data.frame(stringsAsFactors=FALSE,
         date = c("2018-09-02", "2018-09-02", "2018-09-02", "2018-09-02",
                  "2018-09-02", "2018-09-02", "2018-09-02", "2018-09-02",
                  "2018-09-02"),
         householdID = c(18101276L, 18101276L, 18102843L, 18102843L, 18102843L,
                  18102843L, 18104148L, 18104148L, 18104148L),
   Station_id = c(74L, 74L, 62L, 74L, 74L, 74L, 62L, 62L, 67L),
        IndID = c("aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa"),
        Start = c(111300L, 143400L, 030000L, 034900L, 064400L, 070500L, 060400L,
                  075100L, 075100L),
          End = c(111459L, 143759L, 033059L, 035359L, 064759L, 070559L, 060459L,
                  81559L, 81559L),
   start_hour_minute = c(1113L, 1434L, 0300L, 0349L, 0644L, 0705L, 0604L, 0751L, 0751L),
     end_hour_minute = c(1114L, 1437L, 0330L, 0353L, 0647L, 0705L, 0604L, 0815L, 0815L))

【问题讨论】:

    标签: r dplyr tidyverse lubridate


    【解决方案1】:

    您可以按您认为应该对应于单行的变量进行分组(例如家庭-日期-分钟组合),然后计算行数(或 Station_id 中的唯一值),如果该行添加 flag = 1应该标记,否则flag = 0

    df %>% 
        group_by(date, householdID, start_hour_minute) %>% 
        mutate(flag = if_else(n() == 1, 0, 1))
    

    或者,如果您希望 所有 匹配除Station_id 之外的其他变量,您可以这样做

    df %>% 
        group_by_at(vars(-Station_id)) %>% 
        mutate(flag = if_else(n() == 1, 0, 1))
    

    【讨论】:

    • 非常感谢!我在lubridate::int_overlaps() 上苦苦挣扎。谢谢大佬
    【解决方案2】:

    lubridate 包具有inteval 类对象和%within% 函数,用于检查时间戳是否在时间间隔内。您可以使用它来获取标志。

    使用您在上面提供的虚拟数据...

    data_out <- df %>% 
    # Get the hour, minute, and second values as standalone numerics.
    mutate(
        date = ymd(date),
        Start_Hour = floor(Start / 10000),
        Start_Minute = floor((Start - Start_Hour*10000) / 100),
        Start_Second = (Start - Start_Hour*10000) - Start_Minute*100,
        End_Hour = floor(End / 10000),
        End_Minute = floor((End - End_Hour*10000) / 100),
        End_Second = (End - End_Hour*10000) - End_Minute*100,
    # Use the hour, minute, second values to create a start-end timestamp.
        Start_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
        End_TS = ymd_hms(date + hours(Start_Hour) + minutes(Start_Minute) + seconds(Start_Second)),
    # Create an interval object.
        Watch_Interval = interval(start = Start_TS, end = End_TS)
    ) %>% 
    # Group by the IDs.
    group_by(householdID, Station_id) %>% 
    # Flag where the household's interval overlaps with another time.
    mutate(
        overlap_flag = case_when(
            sum(Start_TS %within% as.list(Watch_Interval)) == 0 ~ 0,
            sum(Start_TS %within% as.list(Watch_Interval)) > 0 ~ 1,
            TRUE ~ NA_real_
        )
    ) %>% 
    # dplyr doesn't play nice with interval objects, so we should remove Watch_Interval.
    select(-Watch_Interval)
    

    使用data_out %&gt;% filter(overlap_flag == 1)查看标记值。

    注意: dplyrlubridate 包并不总是能很好地搭配使用,尤其是旧版本。您可能需要更新每个软件包的版本。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2023-01-26
      • 2017-06-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-12-17
      • 1970-01-01
      相关资源
      最近更新 更多