在 R 中绘制 99% 置信区间和预测区间之争答案

【问题标题】：Plotting 99% confidence interval and prediction interval struggle in R在 R 中绘制 99% 置信区间和预测区间之争
【发布时间】：2021-12-21 03:44:02
【问题描述】：

我正在使用 R 中的自动数据我需要绘制置信区间，但这是一场斗争，这是我目前得到的：

我已经创建了线性回归模型

my_acc<-auto_df$acceleration
my_horse<-auto_df$horsepower
mydata <- data.frame(my_acc, my_horse )

car_linear_regression <- lm(my_acc ~ my_horse, mydata )

根据练习的要求，我已经为 ONE 预测创建了置信区间

conf_int<-predict(car_linear_regression,newdata = data.frame(my_horse = 93.5),interval = 'confidence' )
#data.frame(my_horse = 93.5) must be the same as in the original dataframe

pred_int<-predict(car_linear_regression,newdata = data.frame(my_horse = 93.5),interval = 'prediction' )

然后我试图把所有东西都画在一起，但我完全卡住了，我可以用回归线绘制数据，但我只得到这个错误

xy.coords(x, y) 中的错误：“x”和“y”长度不同

plot(my_acc ~ my_horse   , data = mydata, pch = 20, cex  = 1.5, col="blue", xlab=" car horsepower", ylab = "acceleration secs to 100km/h", main = "Confidence intervals and prediction intervals")
abline(car_linear_regression, lwd = 5,  col="red" )

lines(mydata$my_horse, conf_int[,"lwr"], col="red", type="b", pch="+")

【问题讨论】：

你到底想画什么？以 my_acc 和 my_horse 分别作为 x 和 y 轴的散点图，加上对 my_acc 的 conf 间隔所在的图的注释？
我想用散点图、回归线（我已经完成这两个）来绘制数据，并且我想添加置信区间 conf_int 和 pred_int。我得到一个错误，它们的长度不同，当然没有，至少从我有限的理解来看。

标签： r plot confidence-interval

【解决方案1】：

对于情节，您肯定需要对整个范围进行预测，即 min max of horespower。

data('Auto', package='ISLR')  

fo <- acceleration ~ horsepower  ## formula object for re-use

fit <- lm(fo, Auto)

我们需要一个在预测变量horsepower 范围内的序列，因此查看summary 会很有帮助。

summary(Auto)

然后我们创建一个具有合理步长的绘图序列。这将是lines 用来绘制线条的内容。

n_data <- with(Auto, seq(min(horsepower), max(horsepower), by=1))

现在使用序列计算预测，

conf_int <- predict(fit, newdata=list(horsepower=n_data), 
                    interval='confidence', level=.99)
pred_int <- predict(fit, newdata=list(horsepower=n_data), 
                    interval='prediction', level=.99)

然后策划那个人。

plot(fo, data=Auto, pch=20, cex=1, col="blue", 
     xlab=" car horsepower", ylab="acceleration secs to 100km/h", 
     main="Confidence intervals and prediction intervals", xlim=hp_rg)
abline(fit, lwd=2, col="red")
matlines(n_data, conf_int[, 2:3], lty='dashed', col="red", lwd=2)
matlines(n_data, pred_int[, 2:3], lty='dashed', col="green", lwd=2)
legend('topright', legend=c('conf_int', 'pred_int'), col=c("red", "green"),
       lty=2, lwd=2)

请注意，我在这里使用了matlines，这样更简洁，您也可以根据需要使用lines(n_data, conf_int[, 2], ..)、lines(n_data, conf_int[, 3], ..)。

【讨论】：

我觉得这很好，谢谢。

【解决方案2】：

您可以使用ggplot2 并将不同的数据附加到不同的绘图层或仅基于 R：

library(tidyverse)

auto_df <- read_delim(
  file = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
  delim = " ",
  col_names = FALSE
)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   X1 = col_double(),
#>   X2 = col_character(),
#>   X3 = col_character(),
#>   X4 = col_character(),
#>   X5 = col_character(),
#>   X6 = col_character(),
#>   X7 = col_character(),
#>   X8 = col_character(),
#>   X9 = col_character(),
#>   X10 = col_character()
#> )
#> Warning: 246 parsing failures.
#> row col   expected     actual                                                                               file
#>   3  -- 10 columns 9 columns  'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
#>   5  -- 10 columns 9 columns  'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
#>   7  -- 10 columns 9 columns  'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
#>   9  -- 10 columns 9 columns  'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
#>  14  -- 10 columns 11 columns 'https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
#> ... ... .......... .......... ..................................................................................
#> See problems(...) for more details.
auto_df
#> # A tibble: 398 x 10
#>       X1 X2    X3      X4       X5       X6       X7    X8        X9      X10   
#>    <dbl> <chr> <chr>   <chr>    <chr>    <chr>    <chr> <chr>     <chr>   <chr> 
#>  1    18 "  8" "  307… "     1… "     3… "     1… "  7… " 1\t\"c… "cheve… "mali…
#>  2    15 "  8" "  350… "     1… "     3… "     1… "  7… " 1\t\"b… "skyla… "320\…
#>  3    18 "  8" "  318… "     1… "     3… "     1… "  7… " 1\t\"p… "satel…  <NA> 
#>  4    16 "  8" "  304… "     1… "     3… "     1… "  7… " 1\t\"a… "rebel" "sst\…
#>  5    17 "  8" "  302… "     1… "     3… "     1… "  7… " 1\t\"f… "torin…  <NA> 
#>  6    15 "  8" "  429… "     1… "     4… "     1… "  7… " 1\t\"f… "galax… "500\…
#>  7    14 "  8" "  454… "     2… "     4… "      … "  7… " 1\t\"c… "impal…  <NA> 
#>  8    14 "  8" "  440… "     2… "     4… "      … "  7… " 1\t\"p… "fury"  "iii\…
#>  9    14 "  8" "  455… "     2… "     4… "     1… "  7… " 1\t\"p… "catal…  <NA> 
#> 10    15 "  8" "  390… "     1… "     3… "      … "  7… " 1\t\"a… "ambas… "dpl\…
#> # … with 388 more rows

mydata <- tibble(my_acc = auto_df$X6, my_horse = auto_df$X4) %>% mutate_all(as.numeric)
#> Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
mydata
#> # A tibble: 398 x 2
#>    my_acc my_horse
#>     <dbl>    <dbl>
#>  1   12        130
#>  2   11.5      165
#>  3   11        150
#>  4   12        150
#>  5   10.5      140
#>  6   10        198
#>  7    9        220
#>  8    8.5      215
#>  9   10        225
#> 10    8.5      190
#> # … with 388 more rows

car_linear_regression <- lm(my_acc ~ my_horse, mydata)

conf_int <- predict(car_linear_regression, newdata = data.frame(my_horse = 93.5), interval = "confidence")
conf_int
#>       fit     lwr      upr
#> 1 16.0832 15.8765 16.28989
pred_int <- predict(car_linear_regression, newdata = data.frame(my_horse = 93.5), interval = "prediction")
pred_int
#>       fit      lwr      upr
#> 1 16.0832 12.14256 20.02383

# ggplot way
ggplot() +
  geom_rect(
    data = pred_int %>% as.data.frame(),
    mapping = aes(xmin = lwr, xmax = upr, ymin = -Inf, ymax = Inf),
    fill = "BurlyWood"
  ) +
  geom_point(
    data = mydata,
    mapping = aes(my_acc, my_horse)
  ) +
  stat_smooth(
    data = mydata,
    mapping = aes(my_acc, my_horse),
    method = "lm"
  )
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 6 rows containing non-finite values (stat_smooth).
#> Warning: Removed 6 rows containing missing values (geom_point).

# Base R way
plot(my_acc ~ my_horse   , data = mydata, pch = 20, cex  = 1.5, col="blue", xlab=" car horsepower", ylab = "acceleration secs to 100km/h", main = "Confidence intervals and prediction intervals")
rect(min(mydata$my_horse, na.rm = TRUE), pred_int[2], max(mydata$my_horse, na.rm = TRUE), pred_int[3])
abline(car_linear_regression, lwd = 5,  col="red" )

^{由reprex package (v2.0.1) 于 2021-11-08 创建}

【讨论】：

它对我来说太高级了，它会使事情变得更复杂，我只需要用情节来做
plot 通常要复杂得多，因为缺少描述要绘制在顶部的层的语法。 Ggplot 的工作方式更加结构化和一致。