【发布时间】:2017-02-08 19:43:14
【问题描述】:
我正在尝试过滤日期列表,以仅包含一年一次的日期,该日期在每个包含日期重置。
在下表中,我只想过滤掉 include=1 所在的行(在此示例中,我手动创建了 include 列)。
如果你仔细观察:
-
id=10包含在内,因为那是在id=1之后一年多,而id=9还没有。 -
id=22包含在内,因为那是在id=10之后一年多,而id=21还没有。
表格,显然是按testdate 升序排序的:
| id | testdate | include |
| | | |
| | | (I want |
| | | this |
| | | column) |
|:--:|:----------:|:-------:|
| 1 | 2008-02-26 | 1* |
| 2 | 2008-03-07 | 0 |
| 3 | 2008-04-03 | 0 |
| 4 | 2008-04-25 | 0 |
| 5 | 2008-07-23 | 0 |
| 6 | 2008-10-09 | 0 |
| 7 | 2008-10-28 | 0 |
| 8 | 2009-01-14 | 0 |
| 9 | 2009-01-28 | 0 |
| 10 | 2009-05-19 | 1* |
| 11 | 2009-06-05 | 0 |
| 12 | 2009-06-05 | 0 |
| 13 | 2009-06-26 | 0 |
| 14 | 2009-07-15 | 0 |
| 15 | 2009-07-15 | 0 |
| 16 | 2009-08-18 | 0 |
| 17 | 2009-08-18 | 0 |
| 18 | 2009-09-08 | 0 |
| 19 | 2009-09-25 | 0 |
| 20 | 2010-03-19 | 0 |
| 21 | 2010-04-06 | 0 |
| 22 | 2010-06-30 | 1* |
| 23 | 2010-10-07 | 0 |
| 24 | 2010-10-21 | 0 |
| 25 | 2010-10-30 | 0 |
| 26 | 2010-12-10 | 0 |
| 27 | 2011-03-04 | 0 |
| 28 | 2011-05-11 | 0 |
| 29 | 2012-03-08 | 1* |
| 30 | 2012-03-23 | 0 |
| 31 | 2012-09-13 | 0 |
| 32 | 2013-03-21 | 1* |
| 33 | 2014-10-08 | 1* |
-----------------------------
我对 dplyr 库的尝试:
# calculate interval
mutate(interval = as.double(difftime(testdate,lag(testdate), units = 'days'))) %>%
# accumulate interval in days
mutate(interval_cum = if_else(is.na(interval), -1, interval + lag(interval))) %>%
mutate(interval_cum2 = if_else(lag(interval) > 365, 0, interval_cum)) %>%
# filter out first row and all relevant accumulated intervals
mutate(include = if_else(row_number(testdate) == 1 | interval > 365 | interval_cum == -1 | interval_cum2 > 365, 1, 0, 0))
但这会错过 id 的 10、22 和 32,因为我无法遍历多行。有谁知道实现此目的的有效 R 方法?
R 的原始数据输入:
structure(list(testdate = structure(c(13935, 13945, 13972, 13994,
14083, 14161, 14180, 14258, 14272, 14383, 14400, 14400, 14421,
14440, 14440, 14474, 14474, 14495, 14512, 14687, 14705, 14790,
14889, 14903, 14912, 14953, 15037, 15105, 15407, 15422, 15596,
15785, 16351), class = "Date"), include = c(1, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 1, 0, 0, 1, 1)), .Names = c("testdate", "include"), row.names = c(NA,
-33L), class = c("tbl_df", "tbl", "data.frame"))
【问题讨论】:
-
确实如此。就像我说的,我手动创建了列,并且正在寻找一种方法来设置包含列。