基于单独变量值的日期差异答案

【问题标题】：Difference in Dates Based on Value of Separate Variable基于单独变量值的日期差异
【发布时间】：2020-09-16 09:34:21
【问题描述】：

我有一个如下所示的数据框：

        Date Value Value_Increase
1 2020-05-01     5          FALSE
2 2020-05-02     4          FALSE
3 2020-05-03    10           TRUE
4 2020-05-04     9          FALSE
5 2020-05-05     7          FALSE
6 2020-05-06    12           TRUE
7 2020-05-07     8          FALSE

我想创建一个新列，提供自“值”列增加以来的天数。

结果看起来像下面的数据框。

        Date Value Value_Increase Days_Since_Value_Increase
1 2020-05-01     5          FALSE                        NA
2 2020-05-02     4          FALSE                        NA
3 2020-05-03    10           TRUE                        NA
4 2020-05-04     9          FALSE                         1
5 2020-05-05     7          FALSE                         2
6 2020-05-06    12           TRUE                         3
7 2020-05-07     8          FALSE                         1

感谢任何帮助或建议，尤其是那些可能使用 dplyr 方法的人。

创建工作示例的代码：

Date <- as.Date(c("2020-05-01", "2020-05-02", "2020-05-03", "2020-05-04", "2020-05-05", "2020-05-06", "2020-05-07"))
Value <- c(5, 4, 10, 9, 7, 12, 8)
Value_Increase <- c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE)
df <- data.frame(Date, Value, Value_Increase, Days_Since_Value_Increase)

【问题讨论】：

标签： r dataframe datetime time dplyr

【解决方案1】：

如果你想让它在丢失的日子里变得健壮

df %>%
  group_by(g = cumsum(lag(Value_Increase, default = 0))) %>%
  mutate(Days_Since_Value_Increase = ifelse(g == 0, NA, Date - min(Date) + 1))`

# A tibble: 7 x 4
# Groups:   g [3]
  Date       Value     g Days_Since_Value_Increase
  <date>     <dbl> <dbl>                     <dbl>
1 2020-05-01     5     0                        NA
2 2020-05-02     4     0                        NA
3 2020-05-03    10     0                        NA
4 2020-05-04     9     1                         1
5 2020-05-05     7     1                         2
6 2020-05-06    12     1                         3
7 2020-05-07     8     2                         1

【讨论】：

【解决方案2】：

tidyverse 的一种方法是将您的结果与cumsum 分组，以便自上次值增加以来的天数将由组内的row_number() 表示。这假设一天的行之间存在差异。

library(tidyverse)

df %>%
  group_by(g = cumsum(lag(Value_Increase, default = 0))) %>%
  mutate(Days_Since_Value_Increase = ifelse(g == 0, NA, row_number()))

输出

# A tibble: 7 x 5
# Groups:   g [3]
  Date       Value Value_Increase     g Days_Since_Value_Increase
  <date>     <dbl> <lgl>          <dbl>                     <int>
1 2020-05-01     5 FALSE              0                        NA
2 2020-05-02     4 FALSE              0                        NA
3 2020-05-03    10 TRUE               0                        NA
4 2020-05-04     9 FALSE              1                         1
5 2020-05-05     7 FALSE              1                         2
6 2020-05-06    12 TRUE               1                         3
7 2020-05-07     8 FALSE              2                         1

【讨论】：

【解决方案3】：

我们可以在使用cumsum 和lag 的“Value_Increase”创建分组变量后使用dplyr 中的case_when

library(dplyr)
df %>%
  group_by(g = cumsum(lag(Value_Increase, default = 0))) %>%
  mutate(Days_Since_Value_Increase = case_when(g != 0 ~  row_number())) %>%
  ungroup %>%
  select(-g)
# A tibble: 7 x 5
#  Date       Value Value_Increase Drop_From_Prev_Value Days_Since_Value_Increase
#  <date>     <dbl> <lgl>                         <dbl>                     <int>
#1 2020-05-01     5 FALSE                            NA                        NA
#2 2020-05-02     4 FALSE                             1                        NA
#3 2020-05-03    10 TRUE                             -6                        NA
#4 2020-05-04     9 FALSE                             1                         1
#5 2020-05-05     7 FALSE                             2                         2
#6 2020-05-06    12 TRUE                             -5                         3
#7 2020-05-07     8 FALSE                             4                         1

或者用rowid 来自data.table

library(data.table)
df %>% 
  mutate(Days_Since_Value_Increase = replace(rowid(cumsum(lag(Value_Increase,
            default = 0))), 
             seq_len(which.max(Value_Increase)), NA))

【讨论】：