R：通过聚合 OHLC 系列中的值来降低时间序列数据的频率答案

【问题标题】：R: Decrease frequency of time series data by aggregating values in OHLC seriesR：通过聚合 OHLC 系列中的值来降低时间序列数据的频率
【发布时间】：2017-12-18 11:13:52
【问题描述】：

我有一个低至毫秒的外汇汇率高频数据集，我想将其转换为 R 中的较低频率和常规时间序列数据，例如每分钟或 5 分钟 OHLC 系列（开、高、低、收）。原始数据集有四列，一列用于汇率，一列用于时间戳，其中包括日期和时间以及买价和卖价列。数据已从.csv 文件中导入。

{head(GBPUSD)} 和 {tail(GBPUSD)} 返回以下内容：

# A tibble: 6 x 4
       X1                  X2      X3      X4
    <chr>              <dttm>   <dbl>   <dbl>  
1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763  
2 GBP/USD 2017-06-01 00:00:00 1.28754 1.28760  
3 GBP/USD 2017-06-01 00:00:00 1.28754 1.28759  
4 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759  
5 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759  
6 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759


# A tibble: 6 x 4
       X1                  X2      X3      X4
    <chr>              <dttm>   <dbl>   <dbl>
1 GBP/USD 2017-06-30 20:59:56 1.30093 1.30300  
2 GBP/USD 2017-06-30 20:59:56 1.30121 1.30300  
3 GBP/USD 2017-06-30 20:59:56 1.30100 1.30390  
4 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452  
5 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447  
6 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447

【问题讨论】：

如果您包含head(yourdata)、tail(yourdata) 将会很有用。此外，imgur.com 不起作用。您可以使用任何其他存储空间。
谢谢，请找到头部（尾部没有足够的空间）。这些数据直接从 .csv 文件文件中导入 #A tibble: 6 x 4 X1 X2 X3 X4 1 GBP/USD 2017-06-01 00:00:00 1.28756 1.28763 2 英镑/美元 2017-06-01 00:00:00 1.28754 1.28760 3 英镑/美元 2017-06-01 00:00:00 1.28754 1.28759 4 英镑/美元 2017-06-01 00:00:00 1.28753 1.28759 /USD 2017-06-01 00:00:00 1.28753 1.28759 6 GBP/USD 2017-06-01 00:00:00 1.28753 1.28759
相应地编辑您的问题；不是在 cmets 部分，而是在原始问题部分。另外，请使用“{}”代码符号清楚地呈现您的数据。
Rounding time to nearest quarter hour的可能重复
@LenGreski：我想他更关心聚合部分（高、低、开、关）。

标签： r

【解决方案1】：

您似乎想将每列（出价、要价）变成 4 列（开盘价、最高价、最低价、收盘价），按 5 分钟等时间间隔分组。我很欣赏 @dmi3kno 展示了一些 tibbletime 功能，但我认为这可能会做更多你想要的。

~~请注意，这将在 tibbletime 的下一个版本中有所改变，但目前在 0.0.2 下可以使用。~~

对于每 5 分钟的时间段，会采用买价和卖价列的开盘价/最高价/最低价/收盘价。

library(tibbletime)
library(dplyr)

df <- create_series("2017-12-20 00:00:00" ~ "2017-12-20 01:00:00", "sec") %>% 
  mutate(bid = runif(nrow(.)),
         ask = bid + .0001)
df
#> # A time tibble: 3,601 x 3
#> # Index: date
#>    date                   bid    ask
#>  * <dttm>               <dbl>  <dbl>
#>  1 2017-12-20 00:00:00 0.208  0.208 
#>  2 2017-12-20 00:00:01 0.0629 0.0630
#>  3 2017-12-20 00:00:02 0.505  0.505 
#>  4 2017-12-20 00:00:03 0.0841 0.0842
#>  5 2017-12-20 00:00:04 0.986  0.987 
#>  6 2017-12-20 00:00:05 0.225  0.225 
#>  7 2017-12-20 00:00:06 0.536  0.536 
#>  8 2017-12-20 00:00:07 0.767  0.767 
#>  9 2017-12-20 00:00:08 0.994  0.994 
#> 10 2017-12-20 00:00:09 0.807  0.808 
#> # ... with 3,591 more rows

df %>%
  mutate(date = collapse_index(date, "5 min")) %>%
  group_by(date) %>%
  summarise_all(
    .funs = funs(
      open  = dplyr::first(.),
      high  = max(.),
      low   = min(.),
      close = dplyr::last(.)
    )
  )
#> # A time tibble: 13 x 9
#> # Index: date
#>    date                bid_o… ask_o… bid_h… ask_h…  bid_low ask_low bid_c…
#>  * <dttm>               <dbl>  <dbl>  <dbl>  <dbl>    <dbl>   <dbl>  <dbl>
#>  1 2017-12-20 00:04:59  0.208  0.208  1.000  1.000 0.00293  3.03e⁻³ 0.389 
#>  2 2017-12-20 00:09:59  0.772  0.772  0.997  0.997 0.000115 2.15e⁻⁴ 0.676 
#>  3 2017-12-20 00:14:59  0.457  0.457  0.995  0.996 0.00522  5.32e⁻³ 0.363 
#>  4 2017-12-20 00:19:59  0.586  0.586  0.997  0.997 0.00912  9.22e⁻³ 0.0339
#>  5 2017-12-20 00:24:59  0.385  0.385  0.998  0.998 0.0131   1.32e⁻² 0.0907
#>  6 2017-12-20 00:29:59  0.548  0.548  0.996  0.996 0.00126  1.36e⁻³ 0.320 
#>  7 2017-12-20 00:34:59  0.240  0.240  0.995  0.995 0.00466  4.76e⁻³ 0.153 
#>  8 2017-12-20 00:39:59  0.404  0.405  0.999  0.999 0.000481 5.81e⁻⁴ 0.709 
#>  9 2017-12-20 00:44:59  0.468  0.468  0.999  0.999 0.00101  1.11e⁻³ 0.0716
#> 10 2017-12-20 00:49:59  0.580  0.580  0.996  0.996 0.000336 4.36e⁻⁴ 0.395 
#> 11 2017-12-20 00:54:59  0.242  0.242  0.999  0.999 0.00111  1.21e⁻³ 0.762 
#> 12 2017-12-20 00:59:59  0.474  0.474  0.987  0.987 0.000858 9.58e⁻⁴ 0.335 
#> 13 2017-12-20 01:00:00  0.974  0.974  0.974  0.974 0.974    9.74e⁻¹ 0.974 
#> # ... with 1 more variable: ask_close <dbl>

更新：帖子已更新以反映 tibbletime 0.1.0 中的更改。

【讨论】：

谢谢你，戴维斯。我在 tibbletime 中没有看到谓词函数，所以假设它会放弃类。同意这对预期结果更进一步。
@dmi3kno，像summarise_all() 这样的谓词函数是在底层使用summarise() 构建的，因此不会丢弃任何类！

【解决方案2】：

我认为使用aggregate 函数会更容易。但是，根据数据，您可能需要将 datetime 列转换为字符（以防原始数据包含毫秒值）。如果需要，我建议使用lubridate 将它们转换回日期时间。

GBPUSD$X2 <- as.character(GBPUSD$X2) #optional; if the below yields bad results
GBPUSD$X2 <- substr(GBPUSD$X2, 1, 19) #optional; to get only upto minutes after above command
# get High values for both bid and ask prices:
GBPUSD_H <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=max)
# get Low values for both bid and ask prices:
GBPUSD_L <- aggregate(cbind(X3, X4)~X1+X2, data=GBPUSD, FUN=min)
# merging the High and low values together
GBPUSD_NEW <- data.table::merge(GBPUSD_H, GBPUSD_L, by=c("X1", "X2"), suffixes=c(".HIGH", ".LOW"))

一次性获取所有高、低、开和收盘值：

GBPUSD <- data.table(GBPUSD, key=c("X1", "X2"))
GBPUSD_NEW <- GBPUSD[, list(X3.HIGH=max(X3), X3.LOW=min(X3), X3.OPEN=X3[1],
                            X3.CLOSE=X3[length(X3)], X4.HIGH=max(X4), X4.LOW=min(X4),
                            X4.OPEN=X4[1], X4.CLOSE=X4[length(X4)]), by=c("X1", "X2")]

但是，要使其正常工作，您首先需要对数据进行排序，以便第一个值是开盘值，最后一个值是每秒的收盘值。

现在，如果您需要使用分钟而不是秒（或小时），只需相应地调整 substr。如果你想要更多的自定义，比如 15 分钟的间隔，我建议添加一个帮助列。示例代码：

GBPUSD$MIN <- floor(as.numeric(substr(GBPUSD$X2, 15, 16))/15) #getting 00:00 for 00:00-00:15
GBPUSD$X2 <- paste0(substr(GBPUSD$X2, 1, 14), GBPUSD$MIN, ":00")

如果您的要求没有得到满足，请不要犹豫。

P.S.：NAs 会在 aggregate 中产生问题，如果键列有问题。先对付他们。

GBPUSD$X2[is.na(GBPUSD$X2)] <- "2017:05:05 00:00:00" #example; you need to be careful to use same class and format for the replacement

【讨论】：

【解决方案3】：

当你想尝试很棒的tibbletime 包时，这是一个超级完美的例子。我将生成我自己的数据来表明观点

library(tibbletime)
df <- tibbletime::create_series(2017-12-20 + 01:06:00 ~ 2017-12-20 + 01:20:00, "sec") %>% 
         mutate(open=runif(nrow(.)),
                close=runif(nrow(.)))
df

现在这是 15 分钟的秒分辨率数据

# A time tibble: 841 x 3
# Index: date
                  date       open       close
 *              <dttm>      <dbl>       <dbl>
 1 2017-12-20 01:06:00 0.63328803 0.357378011
 2 2017-12-20 01:06:01 0.09597444 0.150583962
 3 2017-12-20 01:06:02 0.23601820 0.974341599
 4 2017-12-20 01:06:03 0.71832656 0.092265867
 5 2017-12-20 01:06:04 0.32471587 0.391190310
 6 2017-12-20 01:06:05 0.76378711 0.534765217
 7 2017-12-20 01:06:06 0.92463265 0.694693458
 8 2017-12-20 01:06:07 0.74026638 0.006054806
 9 2017-12-20 01:06:08 0.77064030 0.911641146
10 2017-12-20 01:06:09 0.87130949 0.740816479
# ... with 831 more rows

改变数据的周期就像一个命令一样简单：

as_period(df, 5~M)

这会将数据聚合到 5 分钟间隔（默认情况下，tibbletime 会为每个周期选择第一个观察值，而不是平均值或总和）

# A time tibble: 3 x 3
# Index: date
                 date      open     close
*              <dttm>     <dbl>     <dbl>
1 2017-12-20 01:06:00 0.6332880 0.3573780
2 2017-12-20 01:11:00 0.9235639 0.7043025
3 2017-12-20 01:16:00 0.6955685 0.1641798

查看这个很棒的vignette 了解更多详情

【讨论】：

如果“tibbletime 默认选择每个时期的第一个观察值，而不是平均值或总和”，那么，这是否意味着 tibbletime 会丢失除第一个观察值以外的观察值信息？对我来说，应该使用数据集中的所有信息。
查看软件包文档。有time_summarize 和time_collapse。在时间序列中，聚合并不总是有意义的。想象一下，您只是不那么频繁地进行测量。平均值永远不会与现实生活相匹配，并且可能会受到异常值的影响。

【解决方案4】：

出于以下教学/教学原因，我稍微更改了 OP 的原始数据集：

df <- data.frame(
X1=c("GBP/USD"), 
X2=c("2017-06-01 00:00:00", "2017-06-01 00:00:00", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:01", "2017-06-01 00:00:02", "2017-06-30 20:59:52", "2017-06-30 20:59:54", "2017-06-30 20:59:54", "2017-06-30 20:59:56", "2017-06-30 20:59:56", "2017-06-30 20:59:56"), 
X3=c(1.28756, 1.28754, 1.28754, 1.28753, 1.28752, 1.28757, 1.30093, 1.30121, 1.30100, 1.30146, 1.30145,1.30145), 
X4=c(1.28763, 1.28760, 1.28759, 1.28758, 1.28755, 1.28760,1.30300, 1.30300, 1.30390, 1.30452, 1.30447, 1.30447), 
stringsAsFactors=FALSE)

df

        X1                  X2      X3      X4
1  GBP/USD 2017-06-01 00:00:00 1.28756 1.28763
2  GBP/USD 2017-06-01 00:00:00 1.28754 1.28760
3  GBP/USD 2017-06-01 00:00:01 1.28754 1.28759
4  GBP/USD 2017-06-01 00:00:01 1.28753 1.28758
5  GBP/USD 2017-06-01 00:00:01 1.28752 1.28755
6  GBP/USD 2017-06-01 00:00:02 1.28757 1.28760
7  GBP/USD 2017-06-30 20:59:52 1.30093 1.30300
8  GBP/USD 2017-06-30 20:59:54 1.30121 1.30300
9  GBP/USD 2017-06-30 20:59:54 1.30100 1.30390
10 GBP/USD 2017-06-30 20:59:56 1.30146 1.30452
11 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447
12 GBP/USD 2017-06-30 20:59:56 1.30145 1.30447

现在，在低频数据中，将有相同事物的分组。所以，我们必须找到对应于唯一开始的索引，以及组的结束：

indices <- seq_along(df[,2])[!(duplicated(df[,2]))] # 1  3  6  7  8 10; the beginnings of groups (observations)
indices - 1   # 0  2  5  6  7   9; for finding the endings of groups
numberoflowfreq <- length(indices) # 6: number of groupings (obs.) for Low Freq data

通过公开写作来理解模式：

mean(df[1:((indices -1)[2]),3]) # from 1 to 2
mean(df[indices[2]:((indices -1)[3]),3]) # from 3 to 5
mean(df[indices[3]:((indices -1)[4]),3]) # from 6 to 6
mean(df[indices[4]:((indices -1)[5]),3]) # from 7 to 7
mean(df[indices[5]:((indices -1)[6]),3]) # from 8 to 9
mean(df[indices[6]:nrow(df),3]) # from 10 to 12

简化模式：

mean3rdColumn_1st <- mean(df[1:((indices -1)[2]),3]) # from 1 to 2
mean3rdColumn_Between <- sapply(2:(numberoflowfreq-1), function(i)  mean(df[indices[i]:((indices -1)[i+1]),3]) )
mean3rdColumn_Last <- mean(df[indices[6]:nrow(df),3]) # from 10 to 12
# 3rd column in low frequency data:    
c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last)

第 4 列也是如此：

mean4thColumn_1st <- mean(df[1:((indices -1)[2]),4]) # from 1 to 2
mean4thColumn_Between <- sapply(2:(numberoflowfreq-1), function(i)  mean(df[indices[i]:((indices -1)[i+1]),4]) )
mean4thColumn_Last <- mean(df[indices[6]:nrow(df),4]) # from 10 to 12
# 4th column in low frequency data: 
c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last)

全力以赴：

LowFrqData <- data.frame(X1=c("GBP/USD"), X2=df[indices,2], X3=c(mean3rdColumn_1st, mean3rdColumn_Between, mean3rdColumn_Last),   x4=c(mean4thColumn_1st, mean4thColumn_Between, mean4thColumn_Last), stringsAsFactors=FALSE)
LowFrqData 

       X1                  X2       X3       x4
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487

现在，X2 列具有唯一的分钟值，X3 和 X4 是通过相关单元格形成的。

另请注意：在一个范围内可能没有所有分钟的值。对于这种情况，可以抽NAs。另一方面，在这种情况下，人们可能会忽略不规则性的影响，因为对于许多观察，观察的间隔将/可能是相同的，因此不那么高度不规则。还要考虑这样一个事实，即使用线性插值将数据转换为等间距的观察结果会引入许多重要且难以量化的偏差（参见：Scholes 和 Williams）。

M. Scholes and J. Williams, “Estimating betas from nonsynchronous data”, Journal of Financial Economics 5: 309–327, 1977.

现在，常规的 5 分钟系列部分：

as.numeric(as.POSIXct("1970-01-01 03:00:00"))  # 0; starting point for ZERO seconds. "1970-01-01 03:01:00" equals 60.
as.numeric(as.POSIXct("2017-06-01 00:00:00")) # 1496264400
# Passed seconds after the first observation in the dataset
PassedSecs <- as.numeric(as.POSIXct(LowFrqData$X2)) - 1496264400

LowFrq5minuteRaw <- cbind(LowFrqData, PassedSecs, stringsAsFactors=FALSE)
LowFrq5minuteRaw

       X1                  X2       X3       x4 PassedSecs
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615          0
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573          1
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600          2
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000    2581192
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450    2581194
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487    2581196

5 分钟意味着 5*60=300 秒。因此，“除以 300 的商数相同”以 5 分钟为间隔对观察进行分组。

LowFrq5minuteRaw2 <- cbind(LowFrqData, PassedSecs, QbyDto300 = PassedSecs%/%300, stringsAsFactors=FALSE)
LowFrq5minuteRaw2

       X1                  X2       X3       x4 PassedSecs QbyDto300
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287615          0         0
2 GBP/USD 2017-06-01 00:00:01 1.287530 1.287573          1         0
3 GBP/USD 2017-06-01 00:00:02 1.287570 1.287600          2         0
4 GBP/USD 2017-06-30 20:59:52 1.300930 1.303000    2581192      8603
5 GBP/USD 2017-06-30 20:59:54 1.301105 1.303450    2581194      8603
6 GBP/USD 2017-06-30 20:59:56 1.301453 1.304487    2581196      8603

indices2 <- seq_along(LowFrq5minuteRaw2[,6])[!(duplicated(LowFrq5minuteRaw2[,6]))] # 1  4; the beginnings of groups

LowFrq5minute <- data.frame(X1=c("GBP/USD"), X2=LowFrq5minuteRaw2[indices2,2], X3=aggregate(LowFrqData[,3] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2], X4=aggregate(LowFrqData[,4] ~ QbyDto300, LowFrq5minuteRaw2, mean)[,2])
LowFrq5minute

       X1                  X2       X3       X4
1 GBP/USD 2017-06-01 00:00:00 1.287550 1.287596
2 GBP/USD 2017-06-30 20:59:52 1.301163 1.303646

X2 持有间隔上 5 分钟 obs 代表的第一次出现的时间戳。

【讨论】：