每年过滤一行答案

【问题标题】：Filter one row every year每年过滤一行
【发布时间】：2017-02-08 19:43:14
【问题描述】：

我正在尝试过滤日期列表，以仅包含一年一次的日期，该日期在每个包含日期重置。

在下表中，我只想过滤掉 include=1 所在的行（在此示例中，我手动创建了 include 列）。如果你仔细观察：

id=10 包含在内，因为那是在 id=1 之后一年多，而 id=9 还没有。
id=22 包含在内，因为那是在 id=10 之后一年多，而 id=21 还没有。

表格，显然是按testdate 升序排序的：

| id |  testdate  | include |
|    |            |         |
|    |            | (I want |
|    |            |  this   |
|    |            | column) |
|:--:|:----------:|:-------:|
|  1 | 2008-02-26 |    1*   |
|  2 | 2008-03-07 |    0    |
|  3 | 2008-04-03 |    0    |
|  4 | 2008-04-25 |    0    |
|  5 | 2008-07-23 |    0    |
|  6 | 2008-10-09 |    0    |
|  7 | 2008-10-28 |    0    |
|  8 | 2009-01-14 |    0    |
|  9 | 2009-01-28 |    0    |
| 10 | 2009-05-19 |    1*   |
| 11 | 2009-06-05 |    0    |
| 12 | 2009-06-05 |    0    |
| 13 | 2009-06-26 |    0    |
| 14 | 2009-07-15 |    0    |
| 15 | 2009-07-15 |    0    |
| 16 | 2009-08-18 |    0    |
| 17 | 2009-08-18 |    0    |
| 18 | 2009-09-08 |    0    |
| 19 | 2009-09-25 |    0    |
| 20 | 2010-03-19 |    0    |
| 21 | 2010-04-06 |    0    |
| 22 | 2010-06-30 |    1*   |
| 23 | 2010-10-07 |    0    |
| 24 | 2010-10-21 |    0    |
| 25 | 2010-10-30 |    0    |
| 26 | 2010-12-10 |    0    |
| 27 | 2011-03-04 |    0    |
| 28 | 2011-05-11 |    0    |
| 29 | 2012-03-08 |    1*   |
| 30 | 2012-03-23 |    0    |
| 31 | 2012-09-13 |    0    |
| 32 | 2013-03-21 |    1*   |
| 33 | 2014-10-08 |    1*   |
-----------------------------

我对 dplyr 库的尝试：

# calculate interval
mutate(interval = as.double(difftime(testdate,lag(testdate), units = 'days'))) %>%
# accumulate interval in days
mutate(interval_cum = if_else(is.na(interval), -1, interval + lag(interval))) %>%
mutate(interval_cum2 = if_else(lag(interval) > 365, 0, interval_cum)) %>%
# filter out first row and all relevant accumulated intervals
mutate(include = if_else(row_number(testdate) == 1 | interval > 365 | interval_cum == -1 | interval_cum2 > 365, 1, 0, 0))

但这会错过 id 的 10、22 和 32，因为我无法遍历多行。有谁知道实现此目的的有效 R 方法？

R 的原始数据输入：

structure(list(testdate = structure(c(13935, 13945, 13972, 13994, 
14083, 14161, 14180, 14258, 14272, 14383, 14400, 14400, 14421, 
14440, 14440, 14474, 14474, 14495, 14512, 14687, 14705, 14790, 
14889, 14903, 14912, 14953, 15037, 15105, 15407, 15422, 15596, 
15785, 16351), class = "Date"), include = c(1, 0, 0, 0, 0, 0, 
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 
0, 1, 0, 0, 1, 1)), .Names = c("testdate", "include"), row.names = c(NA, 
-33L), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】：

确实如此。就像我说的，我手动创建了列，并且正在寻找一种方法来设置包含列。
我认为这些问答可能与这里相关：Subset time series so that selected rows differs by a certain minimum time 和 How to filter rows based on difference in dates between rows in R?

标签： r dplyr

【解决方案1】：

start_date 将包含循环后要包含的日期向量：

start_date <- datum$testdate[1]
for (x in datum$testdate) {
  check_new <- (start_date[length(start_date)] + 365)
  if (x > check_new) {
    start_date <- c(start_date, x)
  }
}

【讨论】：

这很巧妙！但是循环允许在 R 中练习吗？我来自 MySQL，所以我比较注重程序，不认为 R 应该被那样对待。但这行得通！

【解决方案2】：

#Calculate difference in days between rows
difference = df$testdate - df$testdate[1]

#First values >365 signifies start of a new year.
#For other values subtract the first greatest value which is greater than 365
#Repeat until all values are less than 365
while (max(difference) > 365){
difference[which(difference > 365)] = difference[which(difference > 365)] - difference[which(difference > 365)][1]
}

#0 value in difference are the indices you want to extract from df
df[difference == 0,]

或者使用这样的自定义函数

identify_new_year = function(x){
    indices = integer(0)
    start = x[1]
    ind = 1
    indices[ind] = ind
    for (i in 2:length(x)){
        if (as.numeric(x[i] - start >= 365)){
            ind = ind + 1
            indices[ind] = i
            start = x[i]
        }
    }
    return(indices)
}

identify_new_year(df$testdate)
#[1]  1 10 22 29 32 33

【讨论】：

现在那太棒了。
所以复杂性更好？所有这些解决方案都使用循环。我喜欢这个解决方案，但我不觉得它更具可读性。
为了避免多重比较，另一种选择是findInterval:d = df$testdate; inds = 1L; while((i <- findInterval(d[inds[length(inds)]] + 365, d) + 1L) <= length(d)) inds = c(inds, i); inds