带置信区间的滚动回归（tidyverse）答案

【问题标题】：rolling regression with confidence interval (tidyverse)带置信区间的滚动回归（tidyverse）
【发布时间】：2018-08-08 18:13:09
【问题描述】：

这与rolling regression by group in the tidyverse?有关

再次考虑这个简单的例子

library(dplyr)
library(purrr)
library(broom)
library(zoo)
library(lubridate)

mydata = data_frame('group' = c('a','a', 'a','a','b', 'b', 'b', 'b'),
                     'y' = c(1,2,3,4,2,3,4,5),
                     'x' = c(2,4,6,8,6,9,12,15),
                     'date' = c(ymd('2016-06-01', '2016-06-02', '2016-06-03', '2016-06-04',
                                    '2016-06-03', '2016-06-04', '2016-06-05','2016-06-06')))

  group     y     x date      
  <chr> <dbl> <dbl> <date>    
1 a      1.00  2.00 2016-06-01
2 a      2.00  4.00 2016-06-02
3 a      3.00  6.00 2016-06-03
4 a      4.00  8.00 2016-06-04
5 b      2.00  6.00 2016-06-03
6 b      3.00  9.00 2016-06-04
7 b      4.00 12.0  2016-06-05
8 b      5.00 15.0  2016-06-06

我在这里要做的很简单。

对于每个组（在本例中为 a 或 b）：

在最后 2 次观察中计算 y 在 x 上的滚动回归。
将该滚动回归的系数及其置信区间存储在数据框的一列中。

我尝试修改上面现有的解决方案，但添加置信区间被证明是困难的，所以这可行（没有置信区间）：

Coef <- . %>% as.data.frame %>% lm %>% coef

mydata %>% 
  group_by(group) %>% 
  do(cbind(reg_col = select(., y, x) %>% rollapplyr(2, Coef, by.column = FALSE, fill = NA),
           date_col = select(., date))) %>%
  ungroup

# A tibble: 8 x 4
  group `reg_col.(Intercept)` reg_col.x date      
  <chr>                 <dbl>     <dbl> <date>    
1 a      NA                      NA     2016-06-01
2 a       0                       0.5   2016-06-02
3 a       0                       0.5   2016-06-03
4 a       0                       0.5   2016-06-04
5 b      NA                      NA     2016-06-03
6 b       0.00000000000000126     0.333 2016-06-04
7 b      -0.00000000000000251     0.333 2016-06-05
8 b       0                       0.333 2016-06-06

但是，THIS 不起作用（有置信区间）:-(

Coef <- . %>% as.data.frame %>% lm  %>% tidy(., conf.int = TRUE) %>% as_tibble()

> mydata %>% 
+   group_by(group) %>% 
+   do(reg_col = select(., y, x) %>% rollapplyr(2, Coef, by.column = FALSE, fill = NA)) %>%
+   ungroup()
# A tibble: 2 x 2
  group reg_col      
* <chr> <list>       
1 a     <dbl [4 x 2]>
2 b     <dbl [4 x 2]>

这个list-column 非常奇怪。任何想法这里缺少什么？

谢谢！！

【问题讨论】：

Coef 函数应该返回一个向量，而不是数据框或 tibble。
嗨，格洛腾迪克，你来了！感谢您的参与。是的，但我也需要使用这些tidy 友好的软件包。我正在尝试类似Coef <- . %>% as.data.frame %>% lm %>% tidy(., conf.int = TRUE) %>% as_tibble() %>% filter(term == 'surp')
@G.Grothendieck 我在下面尝试了一些方法。请你看看，看看你是否可以改进？我当然很乐意接受你的提议！谢谢！
@G.Grothendieck 我不明白为什么我的输出中有这些烦人的引号......

标签： r dplyr zoo broom

【解决方案1】：

试试这个：

library(dplyr)
library(zoo)

# use better example
set.seed(123)
mydata2 <- mydata %>% mutate(y = jitter(y))

stats <- function(x) {
  fm <- lm(as.data.frame(x))
  slope <- coef(fm)[[2]]
  ci <- confint(fm)[2, ]
  c(slope = slope, conf.lower = ci[[1]], conf.upper = ci[[2]])
}

roll <- function(x) rollapplyr(x, 3, stats, by.column = FALSE, fill = NA)
  
mydata2 %>%
  group_by(group) %>%
  do(cbind(., select(., y, x) %>% roll)) %>%
  ungroup

给予：

# A tibble: 8 x 7
  group     y     x date        slope conf.lower conf.upper
  <chr> <dbl> <dbl> <date>      <dbl>      <dbl>      <dbl>
1 a     0.915     2 2016-06-01 NA         NA         NA    
2 a     2.12      4 2016-06-02 NA         NA         NA    
3 a     2.96      6 2016-06-03  0.512     -0.133      1.16 
4 a     4.15      8 2016-06-04  0.509     -0.117      1.14 
5 b     2.18      6 2016-06-03 NA         NA         NA    
6 b     2.82      9 2016-06-04 NA         NA         NA    
7 b     4.01     12 2016-06-05  0.306     -0.368      0.980
8 b     5.16     15 2016-06-06  0.390      0.332      0.448

更新

自从这个问题第一次出现以来，dplyr 得到了group_modify，它可以用来替换do。 ?group_modify 帮助页面说：group_modify() 是 do() 的演变，如果您以前使用过它的话。

mydata2 %>%
  group_by(group) %>%
  group_modify(~ cbind(., select(., y, x) %>% roll)) %>%
  ungroup

【讨论】：

非常酷！！我不太明白为什么用 tidyverse stats <- function(x) { fm <- lm(as.data.frame(x), formula = 'y~ x +I(x^2)') tidy(fm) %>% filter(term == 'x') %>% pluck() } 替换你的 stats 不起作用。我的意思是，我们不只是在这里提供一个向量吗？
哦，我明白了，这是因为混合数据类型。 stats <- function(x) { fm <- lm(as.data.frame(x), formula = 'y~ x +I(x^2)') tidy(fm) %>% filter(term == 'x') %>% select(-term) %>% slice(1) %>% unlist() } 就像一个魅力。谢谢格洛腾迪克！

【解决方案2】：

这是我目前的尝试，还有很大的改进空间......

为 CI 使用更大的数据

mytest = data_frame('group' = c('a','a', 'a','a','a','a', 'a','a','b', 'b', 'b', 'b','b', 'b', 'b', 'b'),
                    'y' = c(1,2,3,4,2,3,4,5,2,4,6,8,6,9,12,15),
                    'x' = c(2,4,6,8,6,9,12,15,4,2,3,4,5,2,4,6),
                    'date' = c(ymd('2016-06-01', '2016-06-02', '2016-06-03', '2016-06-04',
                                   '2016-06-05', '2016-06-06', '2016-06-07', '2016-06-08',
                                   '2016-06-03', '2016-06-04', '2016-06-05','2016-06-06',
                                   '2016-06-05', '2016-06-06', '2016-06-07', '2016-06-08')))

然后

Coef <- . %>% as.data.frame %>% lm  %>% tidy(., conf.int = TRUE) %>% 
  as_tibble() %>% filter(term == 'x')

final_df <- mytest %>% 
  group_by(group) %>% 
  do(bind_cols(select(., y, x) %>% 
       rollapplyr(4, Coef, by.column = FALSE, fill = NA) %>% 
       as_data_frame(), 
     select(., date))) %>%
  ungroup() 


# A tibble: 16 x 9
   group term  estimate     std.error    statistic      p.value      conf.low     conf.high date      
   <chr> <chr> <chr>        <chr>        <chr>          <chr>        <chr>        <chr>     <date>    
 1 a     NA    NA           NA           NA             NA           NA           NA        2016-06-01
 2 a     NA    NA           NA           NA             NA           NA           NA        2016-06-02
 3 a     NA    NA           NA           NA             NA           NA           NA        2016-06-03
 4 a     x     0.5000000    0.000000e+00 "         Inf" 0.000000e+00 " 0.5000000" 0.5000000 2016-06-04
 5 a     x     0.5000000    2.165064e-01 2.309401e+00   1.471971e-01 -0.4315516   1.4315516 2016-06-05
 6 a     x     0.2962963    3.228814e-01 9.176629e-01   4.556689e-01 -1.0929503   1.6855428 2016-06-06
 7 a     x     0.2800000    1.847521e-01 1.515544e+00   2.688738e-01 -0.5149241   1.0749241 2016-06-07
 8 a     x     0.3333333    5.233642e-17 6.369052e+15   2.465190e-32 " 0.3333333" 0.3333333 2016-06-08
 9 b     NA    NA           NA           NA             NA           NA           NA        2016-06-03
10 b     NA    NA           NA           NA             NA           NA           NA        2016-06-04
11 b     NA    NA           NA           NA             NA           NA           NA        2016-06-05
12 b     x     " 0.3636364" 1.8895100    " 0.1924501"   0.8651600    -7.766269    8.493542  2016-06-06
13 b     x     " 0.8000000" 0.6928203    " 1.1547005"   0.3675445    -2.180965    3.780965  2016-06-05
14 b     x     -0.7000000   0.6557439    -1.0674900     0.3975359    -3.521438    2.121438  2016-06-06
15 b     x     -0.6842105   1.3189436    -0.5187565     0.6556216    -6.359167    4.990746  2016-06-07
16 b     x     " 0.8571429" 1.4846150    " 0.5773503"   0.6220355    -5.530640    7.244926  2016-06-08
Warning messages:
1: In summary.lm(x) : essentially perfect fit: summary may be unreliable
2: In summary.lm(object) :
  essentially perfect fit: summary may be unreliable
3: In summary.lm(x) : essentially perfect fit: summary may be unreliable
4: In summary.lm(object) :
  essentially perfect fit: summary may be unreliable

【讨论】：