在data.table R中按组滚动答案

【问题标题】：Rolling by group in data.table R在data.table R中按组滚动
【发布时间】：2023-03-11 00:15:01
【问题描述】：

我正在尝试按组通过 data.table 滚动我的函数并遇到问题。不知道我应该改变我的功能还是我的电话错了。这是一个简单的例子：

数据

 test <- data.table(return=c(0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2),
                   sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))

我的功能

zoo_fun <- function(dt, N) {
  (rollapply(dt$return + 1, N, FUN=prod, fill=NA, align='right') - 1)
}

运行它（我想创建新的列动量，这将是最近 3 个观察结果的乘积，每个证券加一个（因此按 = 秒分组）。

test[, momentum3 := zoo_fun(test, 3), by=sec]

    Warning messages:
    1: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
      RHS 1 is length 10 (greater than the size (5) of group 1). The last 5 element(s) will be discarded.
    2: In `[.data.table`(test, , `:=`(momentum3, zoo_fun(test, 3)), by = sec) :
      RHS 1 is length 10 (greater than the size (5) of group 2). The last 5 element(s) will be discarded.

我收到了警告，结果不是预期的：

> test
    return sec momentum3
 1:    0.1   A        NA
 2:    0.1   A        NA
 3:    0.1   A     0.331
 4:    0.1   A     0.331
 5:    0.1   A     0.331
 6:    0.2   B        NA
 7:    0.2   B        NA
 8:    0.2   B     0.331
 9:    0.2   B     0.331
10:    0.2   B     0.331

我期待 B 秒充满 0.728 ((1.2*1.2*1.2) -1)，开始时有两个 NA。我究竟做错了什么？是滚动功能不适用于分组吗？

【问题讨论】：

标签： r data.table grouping

【解决方案1】：

This answer 建议使用reduce() 和shift() 来解决data.table 的滚动窗口问题。 This benchmark 表明这可能比 zoo::rollapply() 快得多。

test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
#    return sec momentum
# 1:    0.1   A       NA
# 2:    0.1   A       NA
# 3:    0.1   A    0.331
# 4:    0.1   A    0.331
# 5:    0.1   A    0.331
# 6:    0.2   B       NA
# 7:    0.2   B       NA
# 8:    0.2   B    0.728
# 9:    0.2   B    0.728
#10:    0.2   B    0.728

基准（10行，OP数据集）

microbenchmark::microbenchmark(
  zoo = test[, momentum := zoo_fun(return, 3), by = sec][],
  red  = test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][],
  times = 100L
)
#Unit: microseconds
# expr      min       lq      mean   median        uq      max neval cld
#  zoo 2318.209 2389.131 2445.1707 2421.541 2466.1930 3108.382   100   b
#  red  562.465  625.413  663.4893  646.880  673.4715 1094.771   100  a

基准（10 万行）

为了用小数据集验证基准测试结果，构建了一个更大的数据集：

n_rows <- 1e4
test0 <- data.table(return = rep(as.vector(outer(1:5/100, 1:2/10, "+")), n_rows),
                   sec = rep(rep(c("A", "B"), each = 5L), n_rows))

test0
#        return sec
#     1:   0.11   A
#     2:   0.12   A
#     3:   0.13   A
#     4:   0.14   A
#     5:   0.15   A
#    ---           
# 99996:   0.21   B
# 99997:   0.22   B
# 99998:   0.23   B
# 99999:   0.24   B
#100000:   0.25   B

由于test 正在就地修改，每次基准测试运行都以test0 的新副本开始。

microbenchmark::microbenchmark(
  copy = test <- copy(test0),
  zoo  = {
    test <- copy(test0)
    test[, momentum := zoo_fun(return, 3), by = sec][]
  },
  red  = {
    test <- copy(test0)
    test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
  },
  times = 10L
)

#Unit: microseconds
# expr         min          lq         mean      median          uq         max neval cld
# copy     282.619     294.512     325.3261     298.424     350.272     414.983    10  a 
#  zoo 1129601.974 1144346.463 1188484.0653 1162598.499 1194430.395 1337727.279    10   b
#  red    3354.554    3439.095    6135.8794    5002.008    7695.948   11443.595    10  a

对于 100k 行，Reduce() / shift() 方法比 zoo::rollapply() 快 200 倍以上。

显然，对于预期结果有不同的解释。

为了对此进行调查，使用了修改后的数据集：

test <- data.table(return=c(0.1, 0.11, 0.12, 0.13, 0.14, 0.21, 0.22, 0.23, 0.24, 0.25),
                   sec=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"))
test
#    return sec
# 1:   0.10   A
# 2:   0.11   A
# 3:   0.12   A
# 4:   0.13   A
# 5:   0.14   A
# 6:   0.21   B
# 7:   0.22   B
# 8:   0.23   B
# 9:   0.24   B
#10:   0.25   B

请注意，每个组中的 return 值是变化的，这与 OP 的数据集不同，其中每个 sec 组的 return 值是恒定的。

这样，accepted answer (rollapply()) 返回

test[, momentum := zoo_fun(return, 3), by = sec][]
#    return sec momentum
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.367520
# 4:   0.13   A 0.404816
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.815726
# 9:   0.24   B 0.860744
#10:   0.25   B 0.906500

Henrik's answer 返回：

test[test[ , tail(.I, 3), by = sec]$V1, res := prod(return + 1) - 1, by = sec][]
#    return sec      res
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.442784
# 4:   0.13   A 0.442784
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.906500
# 9:   0.24   B 0.906500
#10:   0.25   B 0.906500

Reduce()/shift() 解决方案返回：

test[, momentum := Reduce(`*`, shift(return + 1.0, 0:2, type="lag")) - 1, by = sec][]
#    return sec momentum
# 1:   0.10   A       NA
# 2:   0.11   A       NA
# 3:   0.12   A 0.367520
# 4:   0.13   A 0.404816
# 5:   0.14   A 0.442784
# 6:   0.21   B       NA
# 7:   0.22   B       NA
# 8:   0.23   B 0.815726
# 9:   0.24   B 0.860744
#10:   0.25   B 0.906500

【讨论】：

您的测试数据有多大？当应用于（大）现实世界数据时，非常小数据的基准可能会产生误导
@docendodiscimus 我添加了一个 100k 行基准，它确认了使用小型 OP 数据集获得的基准结果的指示。
谢谢，我的状态是运行速度占用了我最少的时间（编码需要很多时间）。如果我想在那个 reduce 中使用不同的函数呢？ sd，意思是......他们也工作吗？我喜欢 rollapply，因为它很容易改变功能。

【解决方案2】：

当您使用dt$return 时，整个data.table 会在组内部被选中。只需在函数定义中使用您需要的列，它就可以正常工作：

#use the column instead of the data.table
zoo_fun <- function(column, N) {
  (rollapply(column + 1, N, FUN=prod, fill=NA, align='right') - 1)
}

#now it works fine
test[, momentum := zoo_fun(return, 3), by = sec]

作为单独的说明，您可能不应该使用 return 作为列或变量名。

输出：

> test
    return sec momentum
 1:    0.1   A       NA
 2:    0.1   A       NA
 3:    0.1   A    0.331
 4:    0.1   A    0.331
 5:    0.1   A    0.331
 6:    0.2   B       NA
 7:    0.2   B       NA
 8:    0.2   B    0.728
 9:    0.2   B    0.728
10:    0.2   B    0.728

【讨论】：

很高兴我能帮上忙 :)