R xts - 将不等时间步长 xts 重新采样为等距时间序列答案

【问题标题】：R xts - Resampling unequal time step xts to equidistant time seriesR xts - 将不等时间步长 xts 重新采样为等距时间序列
【发布时间】：2016-08-05 08:13:16
【问题描述】：

我正在使用 xts 时间序列在 R 中工作。

我有什么： 具有不等间隔时间步长的时间序列数据集。

我想得到什么： 具有等间隔时间步长的时间序列，其值对应于与时间步长重叠的原始值的比例（参见下面的示例）。

示例：这样的原创系列：

sample_xts <- as.xts(read.zoo(text='
2016-07-01 00:00:20,   0.0
2016-07-01 00:01:20,  60.0
2016-07-01 00:01:50,  30.0
2016-07-01 00:02:30,  40.0
2016-07-01 00:04:20, 110.0
2016-07-01 00:05:30, 140.0
2016-07-01 00:06:00,  97.0
2016-07-01 00:07:12, 144.0
2016-07-01 00:08:09,   0.0
', sep=',', index=1, tz='', format="%Y-%m-%d %H:%M:%S"))
names(sample_xts) <- c('x')

我想得到一个等距的时间序列，如下所示：

                         x
2016-07-01 00:00:00,   0.0
2016-07-01 00:01:00,  40.0
2016-07-01 00:02:00,  60.0
2016-07-01 00:03:00,  60.0
2016-07-01 00:04:00,  60.0
2016-07-01 00:05:00, 100.0
2016-07-01 00:06:00, 157.0
2016-07-01 00:07:00, 120.0
2016-07-01 00:08:00,  24.0
2016-07-01 00:09:00,   0.0

注意：

一些原始时间步小于新时间步，而其他的更大。
x 的 colSums 保持不变（即 621）。

这是我用来创建上述示例的草图（可能有助于说明我想做的事情）：

我想要一种方法，它不仅限于创建 1 分钟时间步长序列，而且通常适用于任何固定时间步长。

我查看了许多关于 * 的 q/a 并尝试了许多不同的方法，但都没有成功。

任何帮助将不胜感激！谢谢。

【问题讨论】：

标签： r xts

【解决方案1】：

这是我使用zoo 编写的一些代码- 我没有使用太多xts，所以我不知道是否可以应用相同的功能。希望对您有所帮助！

功能

以下函数计算原始数据的每个区间与给定区间重叠的分数（注意：在以下所有代码中，变量名称ta1和ta2指的是开始和结束给定时间间隔（例如，您需要作为输出的每个相等间隔），而 tb1 和 tb2 指的是原始数据（不相等）间隔的开始和结束）：

frac.overlap <- function(ta1,ta2,tb1,tb2){
if(tb1 <= ta1 & tb2 >= ta2) {   # Interval 2 starts earlier and ends later than interval 1
    frac <- as.numeric(difftime(ta2,ta1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
} else if(tb1 >= ta1 & tb2 <= ta2) {    # Interval 2 is fully contained within interval 1
    frac <- 1
} else if(tb1 <= ta1 & tb2 >= ta1) {    # Interval 2 partly overlaps with interval 1 (starts earlier, ends earlier)
    frac <- as.numeric(difftime(tb2,ta1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
} else if (tb1 <= ta2 & tb2 >= ta2){    # Interval 2 partly overlaps with interval 1 (starts later, ends later)
    frac <- as.numeric(difftime(ta2,tb1,units="secs"))/as.numeric(difftime(tb2,tb1,units="secs"))
        } else {                                # No overlap
            frac <- 0
    }

    return(frac)
}

下一个函数确定原始数据集的哪些记录与当前考虑的区间ta1-ta2重叠：

check.overlap <- function(ta1,ta2,tb1,tb2){
ov <- vector("logical",4)
ov[1] <- (tb1 <= ta1 & tb2 >= ta2)  # Interval 2 starts earlier and ends later than interval 1
ov[2] <- (tb1 >= ta1 & tb2 <= ta2)  # Interval 2 is fully contained within interval 1
ov[3] <- (tb1 <= ta1 & tb2 >= ta1)  # Interval 2 partly overlaps with interval 1 (starts earlier, ends earlier)
ov[4] <- (tb1 <= ta2 & tb2 >= ta2)  # Interval 2 partly overlaps with interval 1 (starts later, ends later)
return(as.logical(sum(ov))) # Gives TRUE if at least one element of ov is TRUE, otherwise FALSE
}

（注意：这适用于您提供的示例数据，但在更大的数据集上，我发现它非常慢。由于我编写此代码以使用常规时间步重新采样时间序列，因此我通常使用固定的时间间隔来完成这一步，速度明显更快。根据原始数据的时间间隔修改代码（参见下一个函数的代码）以加快这一步的速度可能很容易。）

下一个函数使用前两个来计算区间ta1-ta2的重采样值：

fracres <- function(tstart,interval,input){
# tstart: POSIX object
# interval: length of interval in seconds
# input: zoo object

ta1 <- tstart
ta2 <- tstart + interval

# First, determine which records of the original data (input) overlap with the current
# interval, to avoid going through the whole object at every iteration
ind <- index(input)
ind1 <- index(lag(input,-1))
recs <- which(sapply(1:length(ind),function(x) check.overlap(ta1,ta2,ind[x],ind1[x])))
#recs <- which(abs(as.numeric(difftime(ind,ta1,units="secs"))) < 601)


# For each record overlapping with the current interval, return the fraction of the input data interval contained in the current interval
if(length(recs) > 0){
    fracs <- sapply(1:length(recs), function(x) frac.overlap(ta1,ta2,ind[recs[x]],ind1[recs[x]]))
    return(sum(coredata(input)[recs]*fracs))

} else {
    return(0)
}
}

（注释掉的行显示如果已知原始时间步长和新时间步长之间的最大时间差，如何获取相关记录。）

应用程序

首先，让我们以zoo 对象的形式读入您的示例数据：

sample_zoo <- read.zoo(text='
2016-07-01 00:00:20,   0.0
2016-07-01 00:01:20,  60.0
2016-07-01 00:01:50,  30.0
2016-07-01 00:02:30,  40.0
2016-07-01 00:04:20, 110.0
2016-07-01 00:05:30, 140.0
2016-07-01 00:06:00,  97.0
2016-07-01 00:07:12, 144.0
2016-07-01 00:08:09,   0.0
', sep=',', index=1, tz='', format="%Y-%m-%d %H:%M:%S")

您的数据集似乎包含瞬时值（“01:20，x 的值是 60”）。由于我为求和值编写了此代码，因此时间戳的含义不同（“从01:20 开始的记录的值为60”）。为了纠正这个问题，需要移动记录：

sample_zoo <- lag(sample_zoo,1)

然后，我们定义一系列POSIXct对象，对应于所需的分辨率：

time.out <- seq.POSIXt(from=as.POSIXct("2016-07-01"),to=(as.POSIXct("2016-07-01")+(60*9)),by="1 min")

然后我们可以应用上面描述的函数fracres：

data.out <- sapply(1:length(time.out), function(x) fracres(tstart=time.out[x],interval=60,input=sample_zoo))

索引和数据组合成一个zoo对象：

zoo.out <- read.zoo(data.frame(time.out,data.out))

最后，时间序列再次移动一步，方向与之前相反：

zoo.out <- lag(zoo.out,-1)

2016-07-01 00:01:00 2016-07-01 00:02:00 2016-07-01 00:03:00 2016-07-01 00:04:00 2016-07-01 00:05:00 2016-07-01 00:06:00 2016-07-01 00:07:00 2016-07-01 00:08:00 2016-07-01 00:09:00 
             40                  60                  60                  60                 100                 157                 120                  24                   0

【讨论】：

谢谢@m.chips！终于开始在我的实时系列中尝试这个。效果很好，但是，是的，正如您指出的那样，即使是相当短的系列，它也会“异常缓慢”。执行时间似乎与系列的长度不成比例地增长 - 指数或 2^N。我的系列有 300 000 到 100 万。受您的算法启发的观察结果我决定尝试其他方法。在下面发布问题的答案。

【解决方案2】：

我最终决定采用“while-loop-way”，并创建了下面的解决方案。它运作良好 - 不是超级快，但执行时间似乎与时间序列的长度成正比。我使用我在问题中发布的小示例以及具有 330 000 个观察值的源时间序列和大约 110 000 个时间步长的目标序列对其进行了测试。

源序列和目标序列都可以有不规则的时间步长。 所得系列之和与源系列之和相同。

性能：速度还可以，但我相信它可以更快。我猜它是 RCpp 版本的明显候选者，对于长系列来说应该明显更快。现在这对我有用，如果/当我开始创建一个 RCpp 版本时，我会在这里发布。

如果您对性能改进有任何建议，请发表！

谢谢！

sameEndTime <- function(i,j,src_index,dest_index){
  if(src_index[i] == dest_index[j]){
    TRUE
  } else {
    FALSE
  }
}

wholeSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
  if(dest_index[j-1] <= src_index[i-1] & src_index[i] <= dest_index[j]){
    TRUE
  } else {
    FALSE
  }
}

wholeDestStepIsWithinSourceStep <- function(i,j,src_index,dest_index){
  if(src_index[i-1] <= dest_index[j-1]  &  dest_index[j] <= src_index[i]){
    TRUE
  } else {
    FALSE
  }
}

onlyEndOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
  if(src_index[i-1] < dest_index[j-1]  &  src_index[i] < dest_index[j] & src_index[i] > dest_index[j-1]){
    TRUE
  } else {
    FALSE
  }
}

onlyStartOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
  if(src_index[i-1] < dest_index[j]  &  src_index[i-1] > dest_index[j-1] & src_index[i] > dest_index[j]){
    TRUE
  } else {
    FALSE
  }
}

resampleToDestTimeSteps <- function(src, dest){
  # src and dest are both xts with only one time series each
  # src is the original series and 
  # dest holds the time steps of the final series
  #
  # NB: there is an issue with the very first time step 
  # (gets ignored in this version)
  #
  original_names <- names(src)
  names(src) <- c("value")
  names(dest) <- c("value")
  dest$value <- dest$value*0.0
  dest$value[is.na(dest$value)] <- 0.0

  dest[1]$value = 0.0

  for(k in 2:length(src)){
    src[k]$value <- src[k]$value/as.numeric(difftime(index(src[k]),index(src[k-1]),units="secs"))
  }
  # First value is NA due to lag at this point (we don't want that)
  src$value[1] = 0.0

  i = 2 # source timestep counter
  j = 2 # destination timestep counter

  src_index = index(src)
  dest_index = index(dest)

  src_length = length(src)
  dest_length = length(dest)

  # Make sure we start with an overlap
  if(src_index[2] < dest_index[1]){
    while(src_index[i] < dest_index[1]){
      i = i + 1
    }
  } else if(dest_index[2] < src_index[1]){
    while(dest_index[j] < src_index[1]){
      j = j + 1
    }
  }

  while(i <= src_length & j <= dest_length){
    if( wholeSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
      dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],src_index[i-1],units="secs"))
      if(sameEndTime(i,j,src_index,dest_index)){
        j = j+1
      }
      i = i+1
    } else if( wholeDestStepIsWithinSourceStep(i,j,src_index,dest_index) ){
      dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(dest_index[j],dest_index[j-1],units="secs"))
      if(sameEndTime(i,j,src_index,dest_index)){
        i = i+1
      }
      j = j+1
    } else if( onlyEndOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
      dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],dest_index[j-1],units="secs"))
      i = i+1
    } else if( onlyStartOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
      diff_time = difftime(dest_index[j],src_index[i-1],units="secs")
      dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(diff_time)
      j = j+1
    } else {
      print("======================================================")
      print(paste0("i=",i,", j=",j))
      print(paste0("src_index[i]   =",src_index[i]))
      print(paste0("dest_index[j]  =",dest_index[j]))
      print(" ")
      print(paste0("src_index[i-1] =",src_index[i-1]))
      print(paste0("dest_index[j-1]=",dest_index[j-1]))
      print("======================================================")
      stop("This should never happen.")
    }
  }
  names(dest) <- original_names
  return(dest)
}

【讨论】：