使用.data.table 观察到的去年（半年、月）的平均值答案

【问题标题】：Mean of last year (halfyear, month) of observations using.data.table使用.data.table 观察到的去年（半年、月）的平均值
【发布时间】：2021-01-07 03:12:41
【问题描述】：

我想根据每日数据计算数据集中所有代码在过去 12、6 和 3 个月的平均交易量。这是示例数据：

library(BatchGetSymbols)
sample <- BatchGetSymbols(tickers = c('AAPL', 'AMZN'), first.date = Sys.Date() - 500)
sample <- sample$df.tickers
sample <- sample[, c('ticker', 'ref.date', 'volume')]

例如，如果日期是 2020-05-15，我想计算 3 个月的平均交易量（从 2020-02-15 到 2020-05-15）。重要的是每天的数据是不规则的，所以第一次约会可能不是 2020-02-15 而是，例如 2020-02-14。

【问题讨论】：

标签： r data.table

【解决方案1】：

这是一种使用滚动时间概念的方法。

我碰巧没有安装那个包，所以我将使用包中提供的ExampleData.rds（我从https://github.com/cran/BatchGetSymbols/tree/master/inst/extdata 下载的）。我将其子集为前两个代码。（我这样做是为了简单/演示，而不是因为它是必需的。我还在包含所有 15 个股票代码的完整数据集上运行了此代码，过去两个时期花费了不到 0.03 秒。）

数据仅包含 2014 年的数据，因此我还将您的月经周期缩短为 1 个月和 3 个月，因为您可以添加任意数量的月经周期。

SAMP <- readRDS("ExampleData.rds")
library(data.table)
setDT(SAMP)
unique(SAMP$ticker)[1:2]
SAMP2 <- SAMP[ticker %in% unique(ticker)[1:2], .(ticker, ref.date, volume)]
SAMP2
#        ticker   ref.date   volume
#   1: ABEV3.SA 2014-01-02  8036139
#   2: ABEV3.SA 2014-01-03 24922793
#   3: ABEV3.SA 2014-01-06  9355961
#   4: ABEV3.SA 2014-01-07 18755025
#   5: ABEV3.SA 2014-01-08 11446953
#  ---                             
# 492: BBAS3.SA 2014-12-22  3222300
# 493: BBAS3.SA 2014-12-23  3234100
# 494: BBAS3.SA 2014-12-26  1553400
# 495: BBAS3.SA 2014-12-29  1984000
# 496: BBAS3.SA 2014-12-30  2800100

我使用magrittr 只是为了将事情分解为管道，这不是必需的。我还添加了lubridate，以便使用add_with_rollback（又名%m-）来巧妙地回顾n 月份，而不会产生奇怪的影响。

library(magrittr)                      # %>%
library(lubridate)                     # %m-% or add_with_rollback, months
past <- c(1, 3)                        # change this to your c(3, 6, 12)
names(past) <- paste0("months", past)

这对使用非相等（或范围）连接的自连接进行操作。由于完成方式（以及列名的状态）存在一些细微差别，我的技术是复制参考列之一（ref.date）；在范围连接之后，范围的 LHS 列保留其名称，但从 RHS 分配的值。不管幕后的编码意图如何，我发现它可能会令人困惑，因此我在复制一列并稍后将其删除时对性能造成了短期影响。由于它不会影响其他列，因此不会大大降低使用data.table（及其引用语义）的整体效率。

我还创建了一个计算列（针对每个周期），称为past.date。然后我使用past.date <= ref.date <= present 的逻辑进行连接（如果只有data.table 可以从更简洁的描述中正确推断:-)。

SAMP2[, present := ref.date]
newdats <- Map(function(nm, P) {
  SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
    SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
    .[, setNames(.(mean(i.volume)), nm), by = .(ticker, ref.date) ]
}, names(past), past)
SAMP2[, c("present", "past.date") := NULL ] # clean up extra columns
out <- Reduce(function(a,b) merge(a, b, by = c("ticker", "ref.date"), all.x = TRUE), newdats, init = SAMP2)
out
#        ticker   ref.date   volume  months1  months3
#   1: ABEV3.SA 2014-01-02  8036139  8036139  8036139
#   2: ABEV3.SA 2014-01-03 24922793 16479466 16479466
#   3: ABEV3.SA 2014-01-06  9355961 14104964 14104964
#   4: ABEV3.SA 2014-01-07 18755025 15267480 15267480
#   5: ABEV3.SA 2014-01-08 11446953 14503374 14503374
#  ---                                               
# 492: BBAS3.SA 2014-12-22  3222300 25233937 16780737
# 493: BBAS3.SA 2014-12-23  3234100 24233945 16626387
# 494: BBAS3.SA 2014-12-26  1553400 24653742 16681359
# 495: BBAS3.SA 2014-12-29  1984000 26420778 16543630
# 496: BBAS3.SA 2014-12-30  2800100 25239744 16368576

为了证明这符合我们的预期，我将展示两件事：

我将在第一次迭代时中断Map，在自联接 (SAMP2[., on=.(...)]) 之后但在汇总之前立即查看数据。（我还将order 以突出显示存在的内容。正确计算/加入不需要此步骤。）

SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ]
#        ticker   ref.date   volume    present  past.date
#   1: ABEV3.SA 2014-01-02  8036139 2014-01-02 2013-12-02
#   2: ABEV3.SA 2014-01-03 24922793 2014-01-03 2013-12-03
#   3: ABEV3.SA 2014-01-06  9355961 2014-01-06 2013-12-06
#   4: ABEV3.SA 2014-01-07 18755025 2014-01-07 2013-12-07
#   5: ABEV3.SA 2014-01-08 11446953 2014-01-08 2013-12-08
#  ---                                                   
# 492: BBAS3.SA 2014-12-22  3222300 2014-12-22 2014-11-22
# 493: BBAS3.SA 2014-12-23  3234100 2014-12-23 2014-11-23
# 494: BBAS3.SA 2014-12-26  1553400 2014-12-26 2014-11-26
# 495: BBAS3.SA 2014-12-29  1984000 2014-12-29 2014-11-29
# 496: BBAS3.SA 2014-12-30  2800100 2014-12-30 2014-11-30

SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
    SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
    .[ order(ticker, ref.date), ]
#          ticker   ref.date   volume    present  past.date i.volume  i.present i.past.date
#     1: ABEV3.SA 2014-01-02  8036139 2014-01-02 2014-01-02  8036139 2014-01-02  2013-12-02
#     2: ABEV3.SA 2014-01-03 24922793 2014-01-02 2014-01-02  8036139 2014-01-02  2013-12-02
#     3: ABEV3.SA 2014-01-03 24922793 2014-01-03 2014-01-03 24922793 2014-01-03  2013-12-03
#     4: ABEV3.SA 2014-01-06  9355961 2014-01-02 2014-01-02  8036139 2014-01-02  2013-12-02
#     5: ABEV3.SA 2014-01-06  9355961 2014-01-03 2014-01-03 24922793 2014-01-03  2013-12-03
#    ---                                                                                   
# 10326: BBAS3.SA 2014-12-30  2800100 2014-12-22 2014-12-22  3222300 2014-12-22  2014-11-22
# 10327: BBAS3.SA 2014-12-30  2800100 2014-12-23 2014-12-23  3234100 2014-12-23  2014-11-23
# 10328: BBAS3.SA 2014-12-30  2800100 2014-12-26 2014-12-26  1553400 2014-12-26  2014-11-26
# 10329: BBAS3.SA 2014-12-30  2800100 2014-12-29 2014-12-29  1984000 2014-12-29  2014-11-29
# 10330: BBAS3.SA 2014-12-30  2800100 2014-12-30 2014-12-30  2800100 2014-12-30  2014-11-30

注意第一个ref.date（2014-01-02）出现了1次（不出意外，这个集合中没有2014年之前的数据），2014-01-03有两行（02和03）等

除此之外，我将更改流程以添加用于每个聚合的数据长度。

newdats <- Map(function(nm, P) {
  SAMP2[, past.date := add_with_rollback(ref.date, months(-P)) ] %>%
    SAMP2[., on = .(ticker == ticker, past.date <= ref.date, present >= ref.date) ] %>%
    .[, setNames(.(mean(i.volume), .N), c(nm, paste0(nm, "_n"))), by = .(ticker, ref.date) ]
}, names(past), past)
SAMP2[, c("present", "past.date") := NULL ]
out <- Reduce(function(a,b) merge(a, b, by = c("ticker", "ref.date"), all.x = TRUE), newdats, init = SAMP2)
out
#        ticker   ref.date   volume  months1 months1_n  months3 months3_n
#   1: ABEV3.SA 2014-01-02  8036139  8036139         1  8036139         1
#   2: ABEV3.SA 2014-01-03 24922793 16479466         2 16479466         2
#   3: ABEV3.SA 2014-01-06  9355961 14104964         3 14104964         3
#   4: ABEV3.SA 2014-01-07 18755025 15267480         4 15267480         4
#   5: ABEV3.SA 2014-01-08 11446953 14503374         5 14503374         5
#  ---                                                                   
# 492: BBAS3.SA 2014-12-22  3222300 25233937        21 16780737        65
# 493: BBAS3.SA 2014-12-23  3234100 24233945        22 16626387        65
# 494: BBAS3.SA 2014-12-26  1553400 24653742        21 16681359        63
# 495: BBAS3.SA 2014-12-29  1984000 26420778        19 16543630        63
# 496: BBAS3.SA 2014-12-30  2800100 25239744        20 16368576        63

我们可以看到第一个观察有 1 行给它喂食，最后（第二个股票行情）行有 20 行给它喂食。这突出表明提供的样本数据没有整月。

（我认为没有n 上下文的mean 可能会产生误导，所以这可能是有意义的。）

请注意：一般来说，data.table 中的范围连接可以高效地完成，但如果做得草率，它们可能会导致内存使用量激增。例如，如果您选择 years 而不是 months，则自联接将更接近基于股票代码的笛卡尔联接。对于较小的数据集，即使这也可能不是问题。随着更多的代码和更大的数据，以及无限的内存和时间，这个问题可以简化为ticker 上的简单（非范围）笛卡尔连接，过滤掉适当的日期（基于past.date，@987654351 @, 和present)，然后总结。

【讨论】：

【解决方案2】：

这是dpyr 和purrr 的方法：

library(dplyr)
library(purrr)
sample %>%
  group_by(ticker) %>%
  summarise(map_dfr(setNames(c(3,6,12),c("3month","6month","12month")),
                    ~ mean(volume[ref.date > (Sys.Date() - 30 * .x)])))
  ticker   `3month`   `6month`  `12month`
  <chr>       <dbl>      <dbl>      <dbl>
1 AAPL   114310505  144409320. 152001362.
2 AMZN     4421073.   4711179.   4966227.

或者data.table:

library(data.table)
setDT(sample)
sample[,lapply(setNames(c(3,6,12),c("3month","6month","12month")),
               function(x)mean(volume[ref.date > (Sys.Date() - 30 * x)])),by = ticker]
   ticker    3month    6month   12month
1:   AAPL 114310505 144409320 152001362
2:   AMZN   4421073   4711179   4966227

【讨论】：

Cambell，我的问题还不够清楚。我想为每个观察计算 3,6 和 12 个月的平均值（ofc，对于第一个 n 观察值将是 NA）。所以我必须构造 3 个新变量。类似于滚动操作，但随着时间的推移，不是显式的 n 参数。