检查日期是否在 R 的间隔内答案

【问题标题】：Check if a date is within an interval in R检查日期是否在 R 的间隔内
【发布时间】：2017-01-06 01:03:38
【问题描述】：

我定义了这三个间隔：

YEAR_1  <- interval(ymd('2002-09-01'), ymd('2003-08-31'))
YEAR_2  <- interval(ymd('2003-09-01'), ymd('2004-08-31')) 
YEAR_3  <- interval(ymd('2004-09-01'), ymd('2005-08-31'))

（在现实生活中，我有 50 个）

我有一个数据框（称为df），其中有一列充满了 lubridate 格式的日期。

我想在df 上附加一个新列，该列具有适当的值YEAR_n，具体取决于日期所在的时间间隔。

类似：

df$YR <- ifelse(df$DATE %within% YEAR_1, 1, NA)

但我不确定如何继续。我想我需要以某种方式使用apply？

这是我的数据框：

structure(c(1055289600, 1092182400, 1086220800, 1074556800, 1109289600, 
1041897600, 1069200000, 1047427200, 1072656000, 1048636800, 1092873600, 
1090195200, 1051574400, 1052179200, 1130371200, 1242777600, 1140652800, 
1137974400, 1045526400, 1111104000, 1073952000, 1052870400, 1087948800, 
1053993600, 1039564800, 1141603200, 1074038400, 1105315200, 1060560000, 
1072051200, 1046217600, 1107129600, 1088553600, 1071619200, 1115596800, 
1050364800, 1147046400, 1083628800, 1056412800, 1159747200, 1087257600, 
1201478400, 1120521600, 1066176000, 1034553600, 1057622400, 1078876800, 
1010880000, 1133913600, 1098230400, 1170806400, 1037318400, 1070409600, 
1091577600, 1057708800, 1182556800, 1091059200, 1058227200, 1061337600, 
1034121600, 1067644800, 1039478400, 1022198400, 1063065600, 1096329600, 
1049760000, 1081728000, 1016150400, 1029801600, 1059350400, 1087257600, 
1181692800, 1310947200, 1125446400, 1057104000, NA, 1085529600, 
1037664000, 1091577600, 1080518400, 1110758400, 1092787200, 1094601600, 
1169424000, 1232582400, 1058918400, 1021420800, 1133136000, 1030320000, 
1060732800, 1035244800, 1090800000, 1129161600, 1055808000, 1060646400, 
1028678400, 1075852800, 1144627200, 1111363200, 1070236800), class = c("POSIXct", 
"POSIXt"), tzone = "UTC")

【问题讨论】：

相关问题 - stackoverflow.com/questions/41132081/…

标签： r lubridate

【解决方案1】：

每个人都有他们最喜欢的工具，我的恰好是 data.table，因为它被称为 dt[i, j, by] 逻辑。

library(data.table)

dt <- data.table(date = as.IDate(pt))

dt[, YR := 0.0 ]                        # I am using a numeric for year here...

dt[ date >= as.IDate("2002-09-01") & date <= as.IDate("2003-08-31"), YR := 1 ]
dt[ date >= as.IDate("2003-09-01") & date <= as.IDate("2004-08-31"), YR := 2 ]
dt[ date >= as.IDate("2004-09-01") & date <= as.IDate("2005-08-31"), YR := 3 ]

我创建了一个data.table 对象，将您的时间转换为日期以供以后比较。然后我设置了一个新列，默认为一个。

然后我们执行三个条件语句：对于三个间隔中的每一个（我只是使用端点手动创建的），我们将 YR 值设置为 1、2 或 3。

这确实有我们想要的效果

R> print(dt, topn=5, nrows=10)
           date YR
  1: 2003-06-11  1
  2: 2004-08-11  2
  3: 2004-06-03  2
  4: 2004-01-20  2
  5: 2005-02-25  3
 ---              
 96: 2002-08-07  0
 97: 2004-02-04  2
 98: 2006-04-10  0
 99: 2005-03-21  3
100: 2003-12-01  2
R> table(dt[, YR])

 0  1  2  3 
26 31 31 12 
R>

也可以简单地通过计算日期差异并截断来做到这一点，但有时稍微明确一点也很好。

编辑：更通用的形式只是在日期上使用算术：

R> dt[, YR2 := trunc(as.numeric(difftime(as.Date(date), 
+                                        as.Date("2001-09-01"),
+                                        unit="days"))/365.25)]
R> table(dt[, YR2])

 0  1  2  3  4  5  6  7  9 
 7 31 31 12  9  5  1  2  1 
R>

这可以在一行中完成。

【讨论】：

David 在这里使用 data.table 中的非等连接给出了一个很好的版本 - stackoverflow.com/a/41132376/496803 ，只要间隔在 data.table 中指定 start/stop/year列。
什么是pt？谢谢
如果你要使用data.table，不妨使用%between% ;-)
@MichaelChirico 完全正确。我知道还有另一个运算符（我用的太少了），但我看错了地方。
@MonicaHeddneck 在我的回答中，pt 是您保存的结构中的 vector，尽管您声称它实际上是 not data.frame （但只有一列POSIXct 向量）。我把它分配给pt，然后形成一个data.table。

【解决方案2】：

您可以使用 walk 包中的 purrr 来实现此目的：

purrr::walk(1:3, ~(df$Year[as.POSIXlt(df$DATE) %within% get(paste0("YEAR_", .))] <<- .))

或者你应该写一个循环来提高可读性（除非你有禁忌）：

df$YR <- NA
for(i in 1:3){
  interval <- get(paste0("YEAR_", i))
  index <-which(as.POSIXlt(df$DATE) %within% interval)
  df$YR[index] <- i
}

【讨论】：

这在包含 220 万个元素的向量上运行得非常快。

【解决方案3】：

使用lubridate 和mapply：

library(lubridate)

dates <- # your data here

# no idea how you generated these, so let's just copy them
YEAR_1 <- interval(ymd('2002-09-01'), ymd('2003-08-31'))
YEAR_2 <- interval(ymd('2003-09-01'), ymd('2004-08-31')) 
YEAR_3 <- interval(ymd('2004-09-01'), ymd('2005-08-31'))

# this should scale nicely
sapply(c(YEAR_1, YEAR_2, YEAR_3), function(x) { mapply(`%within%`, dates, x) })

结果是一个每间隔一列的矩阵：

        [,1]  [,2]  [,3]
  [1,]  TRUE FALSE FALSE
  [2,] FALSE  TRUE FALSE
  [3,] FALSE  TRUE FALSE
  [4,] FALSE  TRUE FALSE
  ... etc. (100 rows in your example data)

使用purrr 可能有更好的编码方式，但我对purrr 太陌生，看不到它。

【讨论】：

我喜欢它，但似乎运行很慢并且创建了一个相当大的对象
正确，它在 10,000 行上运行良好，但超过此速度太慢了。某处一定有瓶颈。

【解决方案4】：

你可以试试这样的：

df = as.data.frame(structure(c(1055289600, 1092182400, 1086220800, 1074556800, 1109289600, 
            1041897600, 1069200000, 1047427200, 1072656000, 1048636800, 1092873600, 
            1090195200, 1051574400, 1052179200, 1130371200, 1242777600, 1140652800, 
            1137974400, 1045526400, 1111104000, 1073952000, 1052870400, 1087948800, 
            1053993600, 1039564800, 1141603200, 1074038400, 1105315200, 1060560000, 
            1072051200, 1046217600, 1107129600, 1088553600, 1071619200, 1115596800, 
            1050364800, 1147046400, 1083628800, 1056412800, 1159747200, 1087257600, 
            1201478400, 1120521600, 1066176000, 1034553600, 1057622400, 1078876800, 
            1010880000, 1133913600, 1098230400, 1170806400, 1037318400, 1070409600, 
            1091577600, 1057708800, 1182556800, 1091059200, 1058227200, 1061337600, 
            1034121600, 1067644800, 1039478400, 1022198400, 1063065600, 1096329600, 
            1049760000, 1081728000, 1016150400, 1029801600, 1059350400, 1087257600, 
            1181692800, 1310947200, 1125446400, 1057104000, NA, 1085529600, 
            1037664000, 1091577600, 1080518400, 1110758400, 1092787200, 1094601600, 
            1169424000, 1232582400, 1058918400, 1021420800, 1133136000, 1030320000, 
            1060732800, 1035244800, 1090800000, 1129161600, 1055808000, 1060646400, 
            1028678400, 1075852800, 1144627200, 1111363200, 1070236800), class = c("POSIXct", 
                                                                                   "POSIXt"), tzone = "UTC"))

colnames(df)[1] = "dates"

YEAR_1_Start = as.Date('2002-09-01')
YEAR_1_End = as.Date('2003-08-31')

YEAR_2_Start = as.Date('2003-09-01')
YEAR_2_End = as.Date('2004-08-31')

YEAR_3_Start = as.Date('2004-09-01')
YEAR_3_End = as.Date('2005-08-31')


df$year = lapply(df$dates,FUN = function(x){
          x = as.Date(x)
          if(is.na(x)){
            return(NA)
          }else if(YEAR_1_Start <= x & x <= YEAR_1_End){
            return("YEAR_1")
          }else if(YEAR_2_Start <= x & x <= YEAR_2_End){
            return("YEAR_2")
          }else if(YEAR_3_Start <= x & x <= YEAR_3_End){
            return("YEAR_3")
          }else{
            return("Other")
          }
})

df
         dates   year
1   2003-06-11 YEAR_1
2   2004-08-11 YEAR_2
3   2004-06-03 YEAR_2
4   2004-01-20 YEAR_2
5   2005-02-25 YEAR_3
6   2003-01-07 YEAR_1
7   2003-11-19 YEAR_2
8   2003-03-12 YEAR_1
9   2003-12-29 YEAR_2
10  2003-03-26 YEAR_1
11  2004-08-19 YEAR_2
12  2004-07-19 YEAR_2
13  2003-04-29 YEAR_1
14  2003-05-06 YEAR_1
15  2005-10-27  Other
16  2009-05-20  Other
17  2006-02-23  Other
18  2006-01-23  Other
19  2003-02-18 YEAR_1
20  2005-03-18 YEAR_3
21  2004-01-13 YEAR_2
22  2003-05-14 YEAR_1
23  2004-06-23 YEAR_2
24  2003-05-27 YEAR_1
25  2002-12-11 YEAR_1
26  2006-03-06  Other
27  2004-01-14 YEAR_2
28  2005-01-10 YEAR_3
29  2003-08-11 YEAR_1
30  2003-12-22 YEAR_2
31  2003-02-26 YEAR_1
32  2005-01-31 YEAR_3
33  2004-06-30 YEAR_2
34  2003-12-17 YEAR_2
35  2005-05-09 YEAR_3
36  2003-04-15 YEAR_1
37  2006-05-08  Other
38  2004-05-04 YEAR_2
39  2003-06-24 YEAR_1
40  2006-10-02  Other
41  2004-06-15 YEAR_2
42  2008-01-28  Other
43  2005-07-05 YEAR_3
44  2003-10-15 YEAR_2
45  2002-10-14 YEAR_1
46  2003-07-08 YEAR_1
47  2004-03-10 YEAR_2
48  2002-01-13  Other
49  2005-12-07  Other
50  2004-10-20 YEAR_3
51  2007-02-07  Other
52  2002-11-15 YEAR_1
53  2003-12-03 YEAR_2
54  2004-08-04 YEAR_2
55  2003-07-09 YEAR_1
56  2007-06-23  Other
57  2004-07-29 YEAR_2
58  2003-07-15 YEAR_1
59  2003-08-20 YEAR_1
60  2002-10-09 YEAR_1
61  2003-11-01 YEAR_2
62  2002-12-10 YEAR_1
63  2002-05-24  Other
64  2003-09-09 YEAR_2
65  2004-09-28 YEAR_3
66  2003-04-08 YEAR_1
67  2004-04-12 YEAR_2
68  2002-03-15  Other
69  2002-08-20  Other
70  2003-07-28 YEAR_1
71  2004-06-15 YEAR_2
72  2007-06-13  Other
73  2011-07-18  Other
74  2005-08-31 YEAR_3
75  2003-07-02 YEAR_1
76        <NA>     NA
77  2004-05-26 YEAR_2
78  2002-11-19 YEAR_1
79  2004-08-04 YEAR_2
80  2004-03-29 YEAR_2
81  2005-03-14 YEAR_3
82  2004-08-18 YEAR_2
83  2004-09-08 YEAR_3
84  2007-01-22  Other
85  2009-01-22  Other
86  2003-07-23 YEAR_1
87  2002-05-15  Other
88  2005-11-28  Other
89  2002-08-26  Other
90  2003-08-13 YEAR_1
91  2002-10-22 YEAR_1
92  2004-07-26 YEAR_2
93  2005-10-13  Other
94  2003-06-17 YEAR_1
95  2003-08-12 YEAR_1
96  2002-08-07  Other
97  2004-02-04 YEAR_2
98  2006-04-10  Other
99  2005-03-21 YEAR_3
100 2003-12-01 YEAR_2

编辑：

如果您可以将间隔放入 data.frame 或 data.table 中，我们可以轻松更改 lapply 以解决此问题：

df$year = lapply(df$dates,FUN = function(x){
  x = as.Date(x)
  if(is.na(x)){
    return(NA)
  }
  for(i in 1:nrow(intervals){
    if(df.intervals[i,"Start"]<=x & x<= df.intervals[i,"End"]){
                    return(paste0(YEAR_,i))}
}})

【讨论】：

【解决方案5】：

这是我对这一切的看法。我喜欢保持整洁；）

> ## load libraries
> library(tidyverse)
> library(lubridate)
> 
> ## define times
> times <- c(1055289600, 1092182400, 1086220800, 1074556800, 1109289600, 
+            1041897600, 1069200000, 1047427200, 1072656000, 1048636800, 1092873600, 
+            1090195200, 1051574400, 1052179200, 1130371200, 1242777600, 1140652800, 
+            1137974400, 1045526400, 1111104000, 1073952000, 1052870400, 1087948800, 
+            1053993600, 1039564800, 1141603200, 1074038400, 1105315200, 1060560000, 
+            1072051200, 1046217600, 1107129600, 1088553600, 1071619200, 1115596800, 
+            1050364800, 1147046400, 1083628800, 1056412800, 1159747200, 1087257600, 
+            1201478400, 1120521600, 1066176000, 1034553600, 1057622400, 1078876800, 
+            1010880000, 1133913600, 1098230400, 1170806400, 1037318400, 1070409600, 
+            1091577600, 1057708800, 1182556800, 1091059200, 1058227200, 1061337600, 
+            1034121600, 1067644800, 1039478400, 1022198400, 1063065600, 1096329600, 
+            1049760000, 1081728000, 1016150400, 1029801600, 1059350400, 1087257600, 
+            1181692800, 1310947200, 1125446400, 1057104000, NA, 1085529600, 
+            1037664000, 1091577600, 1080518400, 1110758400, 1092787200, 1094601600, 
+            1169424000, 1232582400, 1058918400, 1021420800, 1133136000, 1030320000, 
+            1060732800, 1035244800, 1090800000, 1129161600, 1055808000, 1060646400, 
+            1028678400, 1075852800, 1144627200, 1111363200, 1070236800)
> times <- tibble(time = as.POSIXct(times, origin = "1970-01-01", tz = "UTC")) %>% 
+   mutate(time = as_date(time),
+          duplicated = duplicated(time)) ## there are duplicated times!
> 
> 
> ## define years
> year <- c("YEAR_1", "YEAR_2", "YEAR_3")
> interval <- c(interval(ymd("2002-09-01", tz = "UTC"), ymd("2003-08-31", tz = "UTC")),
+               interval(ymd("2003-09-01", tz = "UTC"), ymd("2004-08-31", tz = "UTC")),
+               interval(ymd("2004-09-01", tz = "UTC"), ymd("2005-08-31", tz = "UTC")))
> years <- tibble(year, interval)
> 
> ## check data
> times
# A tibble: 100 x 2
   time       duplicated
   <date>     <lgl>     
 1 2003-06-11 FALSE     
 2 2004-08-11 FALSE     
 3 2004-06-03 FALSE     
 4 2004-01-20 FALSE     
 5 2005-02-25 FALSE     
 6 2003-01-07 FALSE     
 7 2003-11-19 FALSE     
 8 2003-03-12 FALSE     
 9 2003-12-29 FALSE     
10 2003-03-26 FALSE     
# ... with 90 more rows
> years
# A tibble: 3 x 2
  year   interval                      
  <chr>  <S4: Interval>                
1 YEAR_1 2002-09-01 UTC--2003-08-31 UTC
2 YEAR_2 2003-09-01 UTC--2004-08-31 UTC
3 YEAR_3 2004-09-01 UTC--2005-08-31 UTC
> 
> ## create new indicator variavble
> ##
> ## join datasets (length = 3 x 100)
> ## indicator for year
> ## drop NAs
> ## keep "time" and "active"
> ## join with times to get back at full dataset
> ## as duplications, keep only one of them
> crossing(times, years) %>% 
+   mutate(active = if_else(time %within% interval, year, NA_character_)) %>% 
+   drop_na(active) %>% 
+   select(time, active) %>% 
+   right_join(times, by = "time") %>% 
+   distinct() %>% 
+   select(-duplicated)
# A tibble: 100 x 2
   time       active
   <date>     <chr> 
 1 2003-06-11 YEAR_1
 2 2004-08-11 YEAR_2
 3 2004-06-03 YEAR_2
 4 2004-01-20 YEAR_2
 5 2005-02-25 YEAR_3
 6 2003-01-07 YEAR_1
 7 2003-11-19 YEAR_2
 8 2003-03-12 YEAR_1
 9 2003-12-29 YEAR_2
10 2003-03-26 YEAR_1
# ... with 90 more rows

【讨论】：

【解决方案6】：

我们可以：

1st：创建一个data.table包含所有YEAR_N

> interval.dt <- data.table(Interval = c(YEAR_1, YEAR_2, YEAR_3))
> interval.dt
#                         Interval
#1: 2002-09-01 UTC--2003-08-31 UTC
#2: 2003-09-01 UTC--2004-08-31 UTC
#3: 2004-09-01 UTC--2005-08-31 UTC

第二个：使用int_start(interval.dt$Interval) < year < int_end(interval.dt$Interval)定义一个函数以在特定年份日期在interval.dt$Interval范围内时获取interval.dt的行索引

>  findYearIndex <- function(year) {
      interval.dt[,which(int_start(interval.dt$Interval) < year & year < int_end(interval.dt$Interval))]
      }

3rd: sapply findYearIndex 作用于年份日期data.table 中的每个元素

> dt <- data.table(year = df)
> dt$YearIndex <- paste("YEAR", sapply(dt$year, findYearIndex), sep = "_")

> dt
  #         year       YearIndex
  #1: 2003-06-11          YEAR_1
  #2: 2004-08-11          YEAR_2
  #3: 2004-06-03          YEAR_2
  #4: 2004-01-20          YEAR_2
  #5: 2005-02-25          YEAR_3
  #6: 2003-01-07          YEAR_1
  #7: 2003-11-19          YEAR_2
  #8: 2003-03-12          YEAR_1
  #9: 2003-12-29          YEAR_2
 #10: 2003-03-26          YEAR_1
 #11: 2004-08-19          YEAR_2
 #12: 2004-07-19          YEAR_2
 #13: 2003-04-29          YEAR_1
 #14: 2003-05-06          YEAR_1
 #15: 2005-10-27 YEAR_integer(0)
 #ignore the rest of dt

【讨论】：