使用 dplyr 滚动逐步回归答案

【问题标题】：Rolling stepwise regression with dplyr使用 dplyr 滚动逐步回归
【发布时间】：2021-04-01 20:13:45
【问题描述】：

我想使用dplyr、do() 和rollapply() 进行滚动逐步回归。我的数据代码如下所示：

    FUND_DATA <- tibble(
  DATE = 1:10,
  FUND1 = rnorm(10),
  FUND2 = rnorm(10),
  FUND3 = rnorm(10),
  FUND4 = rnorm(10))

这些只是 1-10 期基金的相同价格。对于独立变量，它看起来是一样的：

FACTORS <- tibble(
  DATE = 1:10,
  x1 = rnorm(10),
  x2 = rnorm(10),
  x3 = rnorm(10),
  x4 = rnorm(10))

现在我将上面的两个小标题合并如下：

REG_DATA <- FUND_DATA %>%
  pivot_longer(contains("FUND"),  names_to = "FUND", 
  values_to = "PRICE") %>% arrange(FUND,DATE) %>% left_join(., FACTORS, by = "DATE") %>%  
  group_by(FUND) %>% mutate(RET = PRICE/lag(PRICE)-1) %>% drop_na()

所以我有一些长标题并按基金分组。

  A tibble: 36 x 8
# Groups:   FUND [4]
    DATE FUND    PRICE       x1     x2      x3      x4     RET
   <int> <chr>   <dbl>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
 1     2 FUND1 -1.19   -0.422   -0.872 -0.292  -0.176  -2.04  
 2     3 FUND1 -0.869   1.60     0.247 -0.610   0.170  -0.272 
 3     4 FUND1 -1.60    0.159   -0.757  0.730  -0.154   0.839 
 4     5 FUND1 -1.58   -0.688   -0.718  0.778   0.879  -0.0103
 5     6 FUND1  1.14   -0.00190 -0.956  1.14   -0.953  -1.72  
 6     7 FUND1 -0.452   0.730   -0.344  0.925  -0.593  -1.40  
 7     8 FUND1 -0.809   0.895   -0.987 -0.0791 -0.0133  0.792 
 8     9 FUND1  1.06   -0.503    1.06   1.96    0.362  -2.31  
 9    10 FUND1  0.0358  0.359   -0.370  1.27    0.129  -0.966 
10     2 FUND2 -0.525  -0.422   -0.872 -0.292  -0.176  -0.229 
# ... with 26 more rows

在此数据上，我想为每个基金执行滚动逐步回归，并为每个滚动窗口和基金存储 R^2。因此，对于每个窗口，应该执行逐步回归。我想出了以下代码：

ROLLING <- REG_DATA %>% group_by(FUND) %>% do(R2 = rollapply(., width = 2, function(x){
  summary(step(lm(RET ~ x1+x2+x3+x4, 
                  data = .), direction = "both", trace = 0))$r.squared
  },by.column = FALSE,align = "right"))

代码运行没有错误，但输出是问题所在。这段代码只存储了最后一个滚动窗口（周期 8-10）的 R^2 并覆盖了我认为的其他代码，所以它看起来像这样：

FUND1   c(0.675, 0.675, 0.675,...)
FUND2   c(0.447, 0.447, 0.447,...)
FUND3   .....

你们能帮我让代码为每个窗口存储 R^2 吗？

【问题讨论】：

我认为问题不在于它被覆盖了，我认为问题在于您通过 data = . 而不是 data=x 将完整的数据集传递给模型。我试图修复它，但用后者替换前者不起作用。

标签： r dplyr regression rollapply

【解决方案1】：

我为您的任务提供了一种可能的解决方案，尽管它不使用 do() 或 step()。该方法是将 FUNDS 分离到单个列表项中，将其转换为每日时间序列并从那里开始工作：

library(dplyr)
library(tidyr)
library(zoo)
library(purrr)
library(plyr)

# your dummy data
FUND_DATA <- tibble(
  DATE = 1:10,
  FUND1 = rnorm(10),
  FUND2 = rnorm(10),
  FUND3 = rnorm(10),
  FUND4 = rnorm(10))
# your dummy data
FACTORS <- tibble(
  DATE = 1:10,
  x1 = rnorm(10),
  x2 = rnorm(10),
  x3 = rnorm(10),
  x4 = rnorm(10))

# first part of your code (had to split it to use it later for naming)
REG_DATA <- FUND_DATA %>%
  tidyr::pivot_longer(contains("FUND"),  names_to = "FUND",
                      values_to = "PRICE") %>%
  dplyr::arrange(FUND,DATE) %>% 
  dplyr::left_join(., FACTORS, by = "DATE") 

# make it o a list of timeseries
lts <-  REG_DATA %>%  
  # core data of timeseries is a matrix and allows only one data type (we prefer numeric thus cut "FUND" and preserve only the number)
  dplyr::mutate(FUND = as.numeric(substr(FUND, 5, 5))) %>% 
  group_by(FUND) %>% 
  mutate(RET = PRICE/lag(PRICE)-1) %>% 
  drop_na() %>%
  # split by groups into list items
  dplyr::group_split() %>% 
  # convert each list item to a time series with starting date and length according to each list item 
  purrr::map( ~ xts::xts(.x, order.by  = seq(as.Date("2020-01-01"), as.Date("2020-01-01") + length(.x), by = 1)))

# map the rollapply to the timeseries and extract R² => !!! width should be larger than 2 because you have 4 explanatory variables (6 seems to be the minimum) 
res <- purrr::map(lts, ~ rollapply(.x,width = 6, 
                  FUN = function(x) 
                  summary(lm(RET ~ x1+x2+x3+x4, data = as.data.frame(x)))$r.squared,
                  by.column = FALSE, align = "right"))

# deconstruct the time series to a data.frame (there might be a better way)
res2 <- purrr::map(res,  ~ data.frame(TS = zoo::index(.x),
                                      R2 = zoo::coredata(.x))) 

# get the unqiue FUND names and assing as list item names (you could use a vector instead)
names(res2) <- unique(REG_DATA$FUND)

# condense the list items to a data.frame using the before assinged names as a row
plyr::ldply(res2)


     .id         TS        R2
1  FUND1 2020-01-01        NA
2  FUND1 2020-01-02        NA
3  FUND1 2020-01-03        NA
4  FUND1 2020-01-04        NA
5  FUND1 2020-01-05        NA
6  FUND1 2020-01-06 0.3556052
7  FUND1 2020-01-07 0.7670353
8  FUND1 2020-01-08 0.9077215
9  FUND1 2020-01-09 0.9758644
10 FUND2 2020-01-01        NA
11 FUND2 2020-01-02        NA
12 FUND2 2020-01-03        NA
13 FUND2 2020-01-04        NA
14 FUND2 2020-01-05        NA
15 FUND2 2020-01-06 0.8021993
16 FUND2 2020-01-07 0.8755639
17 FUND2 2020-01-08 0.8206098
18 FUND2 2020-01-09 0.8296576
19 FUND3 2020-01-01        NA
20 FUND3 2020-01-02        NA
21 FUND3 2020-01-03        NA
22 FUND3 2020-01-04        NA
23 FUND3 2020-01-05        NA
24 FUND3 2020-01-06 0.4545569
25 FUND3 2020-01-07 0.4172101
26 FUND3 2020-01-08 0.3604151
27 FUND3 2020-01-09 0.9877962
28 FUND4 2020-01-01        NA
29 FUND4 2020-01-02        NA
30 FUND4 2020-01-03        NA
31 FUND4 2020-01-04        NA
32 FUND4 2020-01-05        NA
33 FUND4 2020-01-06 0.9541878
34 FUND4 2020-01-07 0.9973588
35 FUND4 2020-01-08 0.9991080
36 FUND4 2020-01-09 0.9965382

【讨论】：