如何对依赖于 R 中先前计算的函数进行矢量化？答案

【问题标题】：How to vectorize a function that depends on a previous calculation in R?如何对依赖于 R 中先前计算的函数进行矢量化？
【发布时间】：2021-11-26 07:23:48
【问题描述】：

我有一些漂移的随机游走。我的目标是创建一个函数，在此data.table 中添加一列，根据其累积百分比增益和百分比回撤标记其所在的“区域”。

library(data.table)

set.seed(1)
# generate random returns with drift
df <- data.table(
    "date" = 1:50,
    "ret" = rnorm(50, mean = .002, sd = .01)
)
# calculate the value of the random-walk over-time
df[, val := cumprod(1 + ret)]
df[, draw_down := val / cummax(val) - 1]

第一个区域出现在第一行并上升，直到出现5% cumulative gain 或2% drawdown。

第二个区域在第一个区域结束后开始一行，一直持续到同样的情况再次发生，5% cumulative gain 或 2% drawdown

这会重复，直到这些事情都没有发生，在这种情况下，区域会继续到最后一行。

这是一个可重现的例子：

# start with the first row and zone of 1
idx <- 1
count <- 1
res <- data.table()
while (idx <= nrow(df)) {

    # grab the start of the zone and all future rows
    tmp <- df[idx:.N]
    # calculate the necessary things
    tmp[, val := cumprod(1 + ret)]
    tmp[, draw_down := val / cummax(val) - 1]

    # find out if we crossed our drawdown threshold
    loss_idx <- which(
        tmp$draw_down == min(tmp$draw_down[tmp$draw_down <= -.02])
    )
    # find out if we crossed gain threshold
    gain_idx <- which(tmp$val == min(tmp$val[tmp$val >= 1.05]))
    # if we have no thresholds, label the rest of the zones
    # and exit
    if (length(loss_idx) == 0 & length(gain_idx) == 0) {
        tmp[, zone := count]
        res <- rbind(res, tmp)
        break
    }
    # mark the zone
    tmp[1:min(gain_idx, loss_idx), zone := count]
    # increment our index
    idx <- tmp[min(gain_idx, loss_idx)]$date + 1
    print(idx)
    # increment our zone
    count <- count + 1
    res <- rbind(res, tmp[!is.na(zone)])
}

我已尝试获取这些区域点出现位置的索引。但是后来我遇到了需要根据最后一个区域的索引重新计算val 和drawdown 的问题。我想不出一种方法来矢量化它。也许在这里使用roll 函数会有效？

问题归结为知道按区域的回撤，但需要前一个区域才能计算回撤。与累积回报类似。如果这个函数依赖于之前的值，是否可以对这个函数进行向量化？

在尝试实现以下所需输出时，我们将不胜感激任何方向的帮助。

想要的输出：

> res
date    ret val draw_down   zone
<int>   <dbl>   <dbl>   <dbl>   <dbl>
1   -0.0042645381   0.9957355   0.0000000000    1
2   0.0038364332    0.9995555   0.0000000000    1
3   -0.0063562861   0.9932021   -0.0063562861   1
4   0.0179528080    1.0110328   0.0000000000    1
5   0.0052950777    1.0163863   0.0000000000    1
6   -0.0062046838   1.0100800   -0.0062046838   1
7   0.0068742905    1.0170236   0.0000000000    1
8   0.0093832471    1.0265665   0.0000000000    1
9   0.0077578135    1.0345305   0.0000000000    1
10  -0.0010538839   1.0334402   -0.0010538839   1
11  0.0171178117    1.0511304   0.0000000000    1
12  0.0058984324    1.0058984   0.0000000000    2
13  -0.0042124058   1.0016612   -0.0042124058   2
14  -0.0201469989   0.9814807   -0.0242745373   2
15  0.0132493092    1.0132493   0.0000000000    3
16  0.0015506639    1.0148205   0.0000000000    3
17  0.0018380974    1.0166859   0.0000000000    3
18  0.0114383621    1.0283151   0.0000000000    3
19  0.0102122120    1.0388164   0.0000000000    3
20  0.0079390132    1.0470636   0.0000000000    3
21  0.0111897737    1.0587800   0.0000000000    3
22  0.0098213630    1.0691787   0.0000000000    3
23  0.0027456498    1.0721143   0.0000000000    3
24  -0.0178935170   1.0529304   -0.0178935170   3
25  0.0081982575    1.0615626   -0.0098419551   3
26  0.0014387126    1.0630899   -0.0084174023   3
27  0.0004420449    1.0635598   -0.0079790782   3
28  -0.0127075238   1.0500446   -0.0205852077   3
29  -0.0027815006   0.9972185   0.0000000000    4
30  0.0061794156    1.0033807   0.0000000000    4
31  0.0155867955    1.0190202   0.0000000000    4
32  0.0009721227    1.0200108   0.0000000000    4
33  0.0058767161    1.0260051   0.0000000000    4
34  0.0014619496    1.0275051   0.0000000000    4
35  -0.0117705956   1.0154108   -0.0117705956   4
36  -0.0021499456   1.0132277   -0.0138952351   4
37  -0.0019428995   1.0112591   -0.0158111376   4
38  0.0014068660    1.0126818   -0.0144265157   4
39  0.0130002537    1.0258469   -0.0016138103   4
40  0.0096317575    1.0357276   0.0000000000    4
41  0.0003547640    1.0360951   0.0000000000    4
42  -0.0005336168   1.0355422   -0.0005336168   4
43  0.0089696338    1.0448306   0.0000000000    4
44  0.0075666320    1.0527365   0.0000000000    4
45  -0.0048875569   0.9951124   0.0000000000    5
46  -0.0050749516   0.9900623   -0.0050749516   5
47  0.0056458196    0.9956520   0.0000000000    5
48  0.0096853292    1.0052952   0.0000000000    5
49  0.0008765379    1.0061764   0.0000000000    5
50  0.0108110773    1.0170543   0.0000000000    5

【问题讨论】：

一个疑问：当draw_down=0 时，你打算如何计算2% drawdown？因为没有阈值，因为 0*0.98 为 0。因此，第 4 行将具有区域 2。
我不确定我是否理解。 2% 的回撤实际上是一个 -2% 的阈值。所以 0 的回撤不会触发任何事情

标签： r data.table vectorization

【解决方案1】：

假设您正在探索矢量化以加快计算速度，这里是使用Rccp 加快计算速度的另一个选项：

library(Rcpp)
cppFunction("IntegerVector zoning(NumericVector idx) {
    int zone = 1, n = idx.size();
    IntegerVector res = IntegerVector(n);
    double x0 = idx[0];


    for (int i = 1; i < n; i++) {
        res[i] = zone;
        if (idx[i]/x0 < 0.98 || idx[i]/x0 > 1.05) {
            if (i+1 < n) {
                x0 = idx[i+1];
            }
            zone++;
        }
    }

    return res;
}")

df[, zone := zoning(c(1, val))[-1L]]

输出：

    date           ret       val zone
 1:    1 -0.0042645381 0.9957355    1
 2:    2  0.0038364332 0.9995555    1
 3:    3 -0.0063562861 0.9932021    1
 4:    4  0.0179528080 1.0110328    1
 5:    5  0.0052950777 1.0163863    1
 6:    6 -0.0062046838 1.0100800    1
 7:    7  0.0068742905 1.0170236    1
 8:    8  0.0093832471 1.0265665    1
 9:    9  0.0077578135 1.0345305    1
10:   10 -0.0010538839 1.0334402    1
11:   11  0.0171178117 1.0511304    1
12:   12  0.0058984324 1.0573304    2
13:   13 -0.0042124058 1.0528765    2
14:   14 -0.0201469989 1.0316642    2
15:   15  0.0132493092 1.0453331    3
16:   16  0.0015506639 1.0469540    3
17:   17  0.0018380974 1.0488784    3
18:   18  0.0114383621 1.0608759    3
19:   19  0.0102122120 1.0717098    3
20:   20  0.0079390132 1.0802181    3
21:   21  0.0111897737 1.0923055    3
22:   22  0.0098213630 1.1030334    3
23:   23  0.0027456498 1.1060620    4
24:   24 -0.0178935170 1.0862706    4
25:   25  0.0081982575 1.0951762    4
26:   26  0.0014387126 1.0967518    4
27:   27  0.0004420449 1.0972366    4
28:   28 -0.0127075238 1.0832934    4
29:   29 -0.0027815006 1.0802803    5
30:   30  0.0061794156 1.0869558    5
31:   31  0.0155867955 1.1038979    5
32:   32  0.0009721227 1.1049710    5
33:   33  0.0058767161 1.1114646    5
34:   34  0.0014619496 1.1130896    5
35:   35 -0.0117705956 1.0999878    5
36:   36 -0.0021499456 1.0976229    5
37:   37 -0.0019428995 1.0954903    5
38:   38  0.0014068660 1.0970316    5
39:   39  0.0130002537 1.1112932    5
40:   40  0.0096317575 1.1219969    5
41:   41  0.0003547640 1.1223950    5
42:   42 -0.0005336168 1.1217961    5
43:   43  0.0089696338 1.1318582    5
44:   44  0.0075666320 1.1404225    5
45:   45 -0.0048875569 1.1348486    6
46:   46 -0.0050749516 1.1290893    6
47:   47  0.0056458196 1.1354640    6
48:   48  0.0096853292 1.1464613    6
49:   49  0.0008765379 1.1474662    6
50:   50  0.0108110773 1.1598716    6
    date           ret       val zone

感谢https://rdrr.io/snippets/

【讨论】：

是的，这非常快！与@r2evans 相比，刚刚运行了基准测试。约 5000 倍。 100,000 行比 5000 毫秒快 1 毫秒
矢量化的原因是速度，但是我的C++很弱。最终这个函数会稍微复杂一些，所以我希望能够在 R 中完成大部分工作。但这可能只是意味着我必须加强我的 C++ 技能。在接受这个作为答案之前，我会尝试更多的矢量化方法！

【解决方案2】：

我不认为滚动计算是正确的方法：通常他们有固定的窗口，而这更动态一些。同样，由于类似的原因，累积操作（例如，cumsum）也不起作用。（这并不是说我不能使用 zoo::rollapply 方法来执行此操作，但我认为它的效率会比推荐的方法低得多。）

这是一个简单的 while 循环，似乎提供了您要求的 zone：

breaks <- integer(0)
rn <- 1L
while (rn <= nrow(df)) {
  theserows <- seq(rn, nrow(df))
  ratios <- df$val[theserows] / df$val[theserows][1]
  upordown <- which(ratios >= 1.05 | ratios <= 0.98)
  if (!length(upordown)) break
  breaks <- c(breaks, upordown[1] + rn)
  rn <- rn + upordown[1]
}
df[, zone := cumsum(seq_len(.N) %in% breaks)]
#      date           ret       val     draw_down  zone
#     <int>         <num>     <num>         <num> <int>
#  1:     1 -0.0042645381 0.9957355  0.0000000000     0
#  2:     2  0.0038364332 0.9995555  0.0000000000     0
#  3:     3 -0.0063562861 0.9932021 -0.0063562861     0
#  4:     4  0.0179528080 1.0110328  0.0000000000     0
#  5:     5  0.0052950777 1.0163863  0.0000000000     0
#  6:     6 -0.0062046838 1.0100800 -0.0062046838     0
#  7:     7  0.0068742905 1.0170236  0.0000000000     0
#  8:     8  0.0093832471 1.0265665  0.0000000000     0
#  9:     9  0.0077578135 1.0345305  0.0000000000     0
# 10:    10 -0.0010538839 1.0334402 -0.0010538839     0
# 11:    11  0.0171178117 1.0511304  0.0000000000     0
# 12:    12  0.0058984324 1.0573304  0.0000000000     1
# 13:    13 -0.0042124058 1.0528765 -0.0042124058     1
# 14:    14 -0.0201469989 1.0316642 -0.0242745373     1
# 15:    15  0.0132493092 1.0453331 -0.0113468490     2
# 16:    16  0.0015506639 1.0469540 -0.0098137803     2
# 17:    17  0.0018380974 1.0488784 -0.0079937216     2
# 18:    18  0.0114383621 1.0608759  0.0000000000     2
# 19:    19  0.0102122120 1.0717098  0.0000000000     2
# 20:    20  0.0079390132 1.0802181  0.0000000000     2
# 21:    21  0.0111897737 1.0923055  0.0000000000     2
# 22:    22  0.0098213630 1.1030334  0.0000000000     2
# 23:    23  0.0027456498 1.1060620  0.0000000000     3
# 24:    24 -0.0178935170 1.0862706 -0.0178935170     3
# 25:    25  0.0081982575 1.0951762 -0.0098419551     3
# 26:    26  0.0014387126 1.0967518 -0.0084174023     3
# 27:    27  0.0004420449 1.0972366 -0.0079790782     3
# 28:    28 -0.0127075238 1.0832934 -0.0205852077     3
# 29:    29 -0.0027815006 1.0802803 -0.0233094505     4
# 30:    30  0.0061794156 1.0869558 -0.0172740737     4
# 31:    31  0.0155867955 1.1038979 -0.0019565256     4
# 32:    32  0.0009721227 1.1049710 -0.0009863049     4
# 33:    33  0.0058767161 1.1114646  0.0000000000     4
# 34:    34  0.0014619496 1.1130896  0.0000000000     4
# 35:    35 -0.0117705956 1.0999878 -0.0117705956     4
# 36:    36 -0.0021499456 1.0976229 -0.0138952351     4
# 37:    37 -0.0019428995 1.0954903 -0.0158111376     4
# 38:    38  0.0014068660 1.0970316 -0.0144265157     4
# 39:    39  0.0130002537 1.1112932 -0.0016138103     4
# 40:    40  0.0096317575 1.1219969  0.0000000000     4
# 41:    41  0.0003547640 1.1223950  0.0000000000     4
# 42:    42 -0.0005336168 1.1217961 -0.0005336168     4
# 43:    43  0.0089696338 1.1318582  0.0000000000     4
# 44:    44  0.0075666320 1.1404225  0.0000000000     4
# 45:    45 -0.0048875569 1.1348486 -0.0048875569     5
# 46:    46 -0.0050749516 1.1290893 -0.0099377044     5
# 47:    47  0.0056458196 1.1354640 -0.0043479913     5
# 48:    48  0.0096853292 1.1464613  0.0000000000     5
# 49:    49  0.0008765379 1.1474662  0.0000000000     5
# 50:    50  0.0108110773 1.1598716  0.0000000000     5
#      date           ret       val     draw_down  zone

还有一个简单的函数来做同样的事情：

func <- function(x, up = 1.05, down = 0.98) {
  breaks <- integer(0)
  if (!length(x)) return(breaks)
  ind <- 1L
  while (ind <= length(x)) {
    theseind <- seq(ind, length(x))
    ratios <- x[theseind] / x[theseind][1]
    upordown <- which(ratios >= up | ratios <= down)
    if (!length(upordown)) break
    breaks <- c(breaks, upordown[1] + ind)
    ind <- ind + upordown[1]
  }
  return(cumsum(seq_along(x) %in% breaks))
}
df[, zone := func(val, 1.05, 0.98) ]

【讨论】：

一旦我回到我的机器上，我将运行一个基准测试。我已经查看了frollapply，但正如您所说，该窗口是预先知道的，或者可以预先计算以使用这些类型的函数。递归地做某事会更快吗？也许是findInterval？我要尝试一些事情
我想不出一个看起来如此高效的矢量化或可滚动解决方案。循环重复（在这种情况下）5 次，而 frollapply 和其他滚动函数促进者倾向于运行 nrow(df) 次，给予或接受，这并不是更好。 R 没有很好地进行尾递归（如果我可以为此设计一个尾递归方法......还没有尝试过），所以它的递归效率在这里有点低。我很想知道您是否找到其他更快的方法。
是的，这非常有效，而且比我目前拥有的要好得多。甚至不值得运行基准测试。努力寻找矢量化解决方案。现在将继续尝试Rcpp 也可能是我可能不得不使用的另一个选项。这最终将在数百万行上运行
data.table 滚动函数如果使用adaptive=TRUE，则没有固定窗口
@jangorecki，自适应窗口的问题（如我所见）是在这个问题中，我们不知道先验的宽度是多少；事实上，我的while 循环的基础是能够确定转换指数，此时我们不再需要frollapply。我真的很感激看看这是否可以轻松完成。感谢您的管道！