【问题标题】:Flag consecutive dates by group - R按组标记连续日期 - R
【发布时间】:2021-05-07 18:35:44
【问题描述】:

以下是我的数据示例(房间和日期)。我想生成变量 Goal1 、 Goal2 和 Goal3。每次 Date 变量中存在间隙时,都表示房间已关闭。我的目标是按房间识别连续日期。

  Room    Date         Goal1     Goal2       Goal3
1 Upper A 2021-01-01   1         2021-01-01  2021-01-02
2 Upper A 2021-01-02   1         2021-01-01  2021-01-02
3 Upper A 2021-01-05   2         2021-01-05  2021-01-05
4 Upper A 2021-01-10   3         2021-01-10  2021-01-10
5 Upper B 2021-01-01   1         2021-01-01  2021-01-01
6 Upper B 2021-02-05   2         2021-02-05  2021-02-07
7 Upper B 2021-02-06   2         2021-02-05  2021-02-07
8 Upper B 2021-02-07   2         2021-02-05  2021-02-07
df <- data.frame("Area" = c("Upper A", "Upper A", "Upper A", "Upper A",
                            "Upper B", "Upper B", "Upper B", "Upper B"),
                "Date" = c("1/1/2021", "1/2/2021", "1/5/2021", "1/10/2021",
                           "1/1/2021", "2/5/2021", "2/6/2021", "2/7/2021"))
df$Date <- as.Date(df$Date, format = "%m/%d/%Y")

谢谢你, 马文

【问题讨论】:

  • 您能否提供一个可重现的示例,并说明您提供的数据是您开始使用的数据还是您想要结束的数据?
  • 我刚刚更新了我的帖子。我希望现在很清楚。谢谢!
  • @stribstrib 见上文。
  • 我应该转发吗?我没有收到任何回复。谢谢@stribstrib

标签: r date dplyr group-by cumsum


【解决方案1】:

你也可以这样做

df %>% group_by(Area, Goal1 = cumsum(c(0, diff.Date(Date)) != 1)) %>%
  arrange(Area, Date) %>%
  mutate(Goal2 = min(Date),
         Goal3 = max(Date))

# A tibble: 8 x 5
# Groups:   Area, Goal1 [5]
  Area    Date       Goal1 Goal2      Goal3     
  <chr>   <date>     <int> <date>     <date>    
1 Upper A 2021-01-01     1 2021-01-01 2021-01-02
2 Upper A 2021-01-02     1 2021-01-01 2021-01-02
3 Upper A 2021-01-05     2 2021-01-05 2021-01-05
4 Upper A 2021-01-10     3 2021-01-10 2021-01-10
5 Upper B 2021-01-01     4 2021-01-01 2021-01-01
6 Upper B 2021-02-05     5 2021-02-05 2021-02-07
7 Upper B 2021-02-06     5 2021-02-05 2021-02-07
8 Upper B 2021-02-07     5 2021-02-05 2021-02-07

【讨论】:

    【解决方案2】:
    # Original Data (Note I use a different method to convert the Date to date format below)
    df <- data.frame("Area" = c("Upper A", "Upper A", "Upper A", "Upper A",
                                    "Upper B", "Upper B", "Upper B", "Upper B"),
                        "Date" = c("1/1/2021", "1/2/2021", "1/5/2021", "1/10/2021",
                                   "1/1/2021", "2/5/2021", "2/6/2021", "2/7/2021"))
    

    这是一种可能的解决方案。我创建了一个带有嵌套if_else() 语句的额外列,该语句标识每个连续日期“组”的开始日期。 我在最终数据集中留下了额外的列,以更好地说明代码中发生的情况。

    library(lubridate) # I suggest lubridate for working with dates
    # It sticks with the dplyr/tidyverse syntax
        
    df.grouped <- df %>% 
      mutate(Date = mdy(Date)) %>% #convert characters to actual dates in month-day-year format
      arrange(Area, Date) %>% # arrange data in order by area, then Date
      group_by(Area) %>% # group by Area
      mutate(group_start = if_else(row_number() == 1, 1, #group_start gives the start of consecutive groups of days a 1, other dates a 0
                                if_else(Date-lag(Date) == 1, 0, 1)),
             group_id = cumsum(group_start)) %>%  #group_id cumulatively adds the group_start column, effectively generating a new id # for each group start day
      group_by(Area, group_id) %>% # re-group the data by Area AND group_id
      mutate(start_date = min(Date), #find the min (start) and max (end) dates for each group
             end_date = max(Date))
    

    最终结果:

    df.grouped
    
    > df.grouped
    # A tibble: 8 x 6
    # Groups:   Area, group_id [5]
      Area    Date       group_start group_id start_date end_date  
      <chr>   <date>           <dbl>    <dbl> <date>     <date>    
    1 Upper A 2021-01-01           1        1 2021-01-01 2021-01-02
    2 Upper A 2021-01-02           0        1 2021-01-01 2021-01-02
    3 Upper A 2021-01-05           1        2 2021-01-05 2021-01-05
    4 Upper A 2021-01-10           1        3 2021-01-10 2021-01-10
    5 Upper B 2021-01-01           1        1 2021-01-01 2021-01-01
    6 Upper B 2021-02-05           1        2 2021-02-05 2021-02-07
    7 Upper B 2021-02-06           0        2 2021-02-05 2021-02-07
    8 Upper B 2021-02-07           0        2 2021-02-05 2021-02-07
      
    

    【讨论】:

    • 使用as.Date()转换的原始“日期”数据也应该可以正常工作,只需在解决方案中取出初始mutate()行。