ggplot2：如何为 geom_smooth 中的预测获得稳健的置信区间？答案

【问题标题】：ggplot2: how to get robust confidence interval for predictions in geom_smooth?ggplot2：如何为 geom_smooth 中的预测获得稳健的置信区间？
【发布时间】：2017-12-31 23:32:46
【问题描述】：

考虑这个简单的例子

dataframe <- data_frame(x = c(1,2,3,4,5,6),
                        y = c(12,24,24,34,12,15))
> dataframe
# A tibble: 6 x 2
      x     y
  <dbl> <dbl>
1     1    12
2     2    24
3     3    24
4     4    34
5     5    12
6     6    15    

dataframe %>% ggplot(., aes(x = x, y = y)) + 
geom_point() + 
geom_smooth(method = 'lm', formula = y~x)

这里使用默认选项计算标准误差。但是，我想使用 sandwich 和 lmtest 包中提供的 robust 方差-协方差矩阵

即使用vcovHC(mymodel, "HC3")

有没有办法使用geom_smooth() 函数以简单的方式获得它？

【问题讨论】：

你不能直接在ggplot2中这样做。您需要使用sandwich 手动生成上下置信带，然后将它们提供给geom_ribbon()。执行此操作时，请确保在 geom_smooth() 中设置了 se = FALSE，以便仅显示 geom_ribbon。
@noah，很有趣。你介意发布一个解决方案吗？

标签： r ggplot2 regression

【解决方案1】：

更新：2021-03-17 最近有人向我指出，ggeffects 包会自动处理不同的 VCOV，包括我最初在下面演示的更复杂的 HAC 案例。后者的快速示例：

library(ggeffects)
library(sandwich)  ## For HAC and other robust VCOVs

d <- data.frame(x = c(1,2,3,4,5,6),
                                y = c(12,24,24,34,12,15))

reg1 <- lm(y ~ x, data = d)

plot(ggpredict(reg1, "x", vcov.fun = "vcovHAC"))
#> Loading required namespace: ggplot2

## This gives you a regular ggplot2 object. So you can add layers as you
## normally would. E.g. If you'd like to compare with the original data...
library(ggplot2)
last_plot() +
    geom_point(data = d, aes(x, y)) +
    labs(caption = 'Shaded region indicates HAC 95% CI.')

^{由reprex package (v1.0.0) 于 2021-03-17 创建}

我的原始答案如下...

HC 稳健的 SE（简单）

感谢estimatr 包及其lm_robust 函数家族，现在这很容易完成。例如

library(tidyverse)
library(estimatr)

d <- data.frame(x = c(1,2,3,4,5,6),
                y = c(12,24,24,34,12,15))

d %>% 
  ggplot(aes(x = x, y = y)) + 
  geom_point() + 
  geom_smooth(method = 'lm_robust', formula = y~x, fill="#E41A1C") + ## Robust (HC) SEs
  geom_smooth(method = 'lm', formula = y~x, col = "grey50") + ## Just for comparison
  labs(
    title = "Plotting HC robust SEs in ggplot2",
    subtitle = "Regular SEs in grey for comparison"
    ) +
  theme_minimal()

^{由reprex package (v0.3.0) 于 2020 年 3 月 8 日创建}

HAC 强大的 SE（更多的跑腿工作）

需要注意的是，estimatr does not 还提供对 HAC 的支持（即异方差性和自相关一致）SEs a la Newey-西。但是，可以使用 sandwich 包手动获取这些...无论如何，这就是原始问题所要问的。然后您可以使用geom_ribbon() 绘制它们。

我要郑重声明，HAC SE 对这个特定的数据集没有多大意义。但这里有一个例子，你可以如何做到这一点，即兴表演 this excellent SO 对相关主题的回答。

library(tidyverse)
library(sandwich)

d <- data.frame(x = c(1,2,3,4,5,6),
                y = c(12,24,24,34,12,15))

reg1 <- lm(y~x, data = d)

## Generate a prediction DF
pred_df <- data.frame(fit = predict(reg1))

## Get the design matrix
X_mat <- model.matrix(reg1)

## Get HAC VCOV matrix and calculate SEs
v_hac <- NeweyWest(reg1, prewhite = FALSE, adjust = TRUE) ## HAC VCOV (adjusted for small data sample)
#> Warning in meatHAC(x, order.by = order.by, prewhite = prewhite, weights =
#> weights, : more weights than observations, only first n used
var_fit_hac <- rowSums((X_mat %*% v_hac) * X_mat)  ## Point-wise variance for predicted mean
se_fit_hac <- sqrt(var_fit_hac) ## SEs

## Add these to pred_df and calculate the 95% CI
pred_df <-
  pred_df %>%
  mutate(se_fit_hac = se_fit_hac) %>%
  mutate(
    lwr_hac = fit - qt(0.975, df=reg1$df.residual)*se_fit_hac,
    upr_hac = fit + qt(0.975, df=reg1$df.residual)*se_fit_hac
    )

pred_df
#>        fit se_fit_hac   lwr_hac  upr_hac
#> 1 20.95238   4.250961  9.149822 32.75494
#> 2 20.63810   2.945392 12.460377 28.81581
#> 3 20.32381   1.986900 14.807291 25.84033
#> 4 20.00952   1.971797 14.534936 25.48411
#> 5 19.69524   2.914785 11.602497 27.78798
#> 6 19.38095   4.215654  7.676421 31.08548

## Plot it
bind_cols(
  d,
  pred_df
  ) %>%
  ggplot(aes(x = x, y = y, ymin=lwr_hac, ymax=upr_hac)) + 
  geom_point() + 
  geom_ribbon(fill="#E41A1C", alpha=0.3, col=NA) + ## Robust (HAC) SEs
  geom_smooth(method = 'lm', formula = y~x, col = "grey50") + ## Just for comparison
  labs(
    title = "Plotting HAC SEs in ggplot2",
    subtitle = "Regular SEs in grey for comparison",
    caption = "Note: Do HAC SEs make sense for this dataset? Definitely not!"
    ) +
  theme_minimal()

^{由reprex package (v0.3.0) 于 2020 年 3 月 8 日创建}

请注意，如果您愿意，也可以使用此方法手动计算和绘制其他稳健的 SE 预测（例如 HC1、HC2 等）。您需要做的就是使用相关的三明治估计器。例如，使用 vcovHC(reg1, type = "HC2") 而不是 NeweyWest(reg1, prewhite = FALSE, adjust = TRUE) 将为您提供与使用 estimatr 包的第一个示例相同的 HC-robust CI。

【讨论】：

【解决方案2】：

我对整个强大的 SE 东西非常陌生，但我能够生成以下内容：

zz = '
x y
1     1    12
2     2    24
3     3    24
4     4    34
5     5    12
6     6    15 
'

df <- read.table(text = zz, header = TRUE)
df

library(sandwich)
library(lmtest)

lm.model<-lm(y ~ x, data = df)
coef(lm.model)
se = sqrt(diag(vcovHC(lm.model, type = "HC3")))
fit = predict(lm.model)
predframe <- with(df,data.frame(x,
                                y = fit,
                                lwr = fit - 1.96 * se,
                                upr = fit + 1.96 * se))

library(ggplot2)
ggplot(df, aes(x = x, y = y))+
  geom_point()+
  geom_line(data = predframe)+
  geom_ribbon(data = predframe, aes(ymin = lwr,ymax = upr), alpha = 0.3)

【讨论】：

我不认为你的计算是正确的，不幸的是
当发现新数据时，不确定性会下降，而不是上升。查看克里金法。