【问题标题】:How to plot vector of bootstrapped slopes in ggplot2?如何在 ggplot2 中绘制自举斜率的矢量?
【发布时间】:2018-10-12 21:48:57
【问题描述】:

我一直在使用 ggplot2 绘制引导各种统计输出(例如相关系数)的结果。最近,我引导了线性回归模型的斜率。下面是使用 graphics 包中的 plot() 函数的样子:

plot(main="Relationship Between Eruption Length at Wait Time at \n 
 Old Faithful With Bootstrapped Regression Lines", 
 xlab = "Eruption Length (minutes)", 
 ylab = "Wait Time (minutes)", 
 waiting ~ eruptions, 
 data = faithful, 
 col = spot_color, 
 pch = 19)

index <- 1:nrow(faithful)
for (i in 1:10000) {
    index_boot <- sample(index, replace = TRUE) #getting a boostrap sample (of indices) 
    faithful_boot <- faithful[index_boot, ]
    # Fitting the linear model to the bootstrapped data:
    fit.boot <- lm(waiting ~ eruptions, data = faithful_boot)
    abline(fit.boot, lwd = 0.1, col = rgb(0, 0.1, 0.25, alpha = 0.05)) # Add line to plot
}
fit <- lm(waiting ~ eruptions, data=faithful)
abline(fit, lwd = 2.5, col = "blue")

这可行,但这取决于我们首先创建绘图,然后在循环中添加线条的工作流程。我宁愿用函数创建一个斜坡列表,然后在 ggplot2 中绘制所有斜坡。

例如,函数可能如下所示:

set.seed(777) # included so the following output is reproducible
n_resample <- 10000 # set the number of times to resample the data

# First argument is the data; second is the number of resampled datasets
bootstrap <- function(df, n_resample) {
    slope_resample <- matrix(NA, nrow = n_resample) # initialize vector 
    index <- 1:nrow(df) # create an index for supplied table

    for (i in 1:n_resample) {
        index_boot <- sample(index, replace = TRUE) # sample row numbers, with replacement
        df_boot <- df[index_boot, ] # create a bootstrap sample from original data
        a <- lm(waiting ~ eruptions, data=df_boot) # compute linear model
        slope_resample[i] <- slope <- a$coefficients[2] # take the slope
    }
    return(slope_resample) # Return a vector of differences of proportion
}

bootstrapped_slopes <- bootstrap(faithful, 10000)

但是如何让geom_line()geom_smooth()bootstrapped_slopes 中获取数据呢?非常感谢任何帮助。

【问题讨论】:

    标签: r ggplot2 statistics


    【解决方案1】:

    编辑:从 OP 更直接的改编

    对于绘图,我假设您需要斜率和截距,所以这里有一个修改后的 bootstrap 函数:

    bootstrap <- function(df, n_resample) {
      # Note 2 dimensions here, for slope and intercept
      slope_resample <- matrix(NA, 2, nrow = n_resample) # initialize vector 
      index <- 1:nrow(df) # create an index for supplied table
      
      for (i in 1:n_resample) {
        index_boot <- sample(index, replace = TRUE) # sample row numbers, with replacement
        df_boot <- df[index_boot, ] # create a bootstrap sample from original data
        a <- lm(waiting ~ eruptions, data=df_boot) # compute linear model
        slope_resample[i, 1] <- slope <- a$coefficients[1] # take the slope
        slope_resample[i, 2] <- intercept <- a$coefficients[2] # take the intercept
      }
      # Return a data frame with all the slopes and intercepts 
      return(as.data.frame(slope_resample))      
    }
    

    然后运行它并从该数据框中绘制线条:

    bootstrapped_slopes <- bootstrap(faithful, 10000)
    
    library(dplyr); library(ggplot2)
    ggplot(faithful, aes(eruptions, waiting)) +
      geom_abline(data = bootstrapped_slopes %>% 
                    sample_n(1000), # 10k lines look about the same as 1k, just darker and slower
                  aes(slope =  V2, intercept = V1), #, group = id), 
                  alpha = 0.01) +
      geom_point(shape = 19, color = "red")
    

    替代解决方案

    这也可以使用modelrbroom 来简化一些引导。根据modelr::bootstrap 的主要帮助示例,我们可以执行以下操作:

    library(purrr); library(modelr); library(broom); library(dplyr)
    set.seed(777) 
    
    # Creates bootstrap object with 10k extracts from faithful
    boot <- modelr::bootstrap(faithful, 10000)
    # Applies the linear regression to each
    models <- map(boot$strap, ~ lm(waiting ~ eruptions, data = .))
    # Extracts the model results into a tidy format
    tidied <- map_df(models, broom::tidy, .id = "id")
    # We just need the slope and intercept here
    tidied_wide <- tidied %>% select(id, term, estimate) %>% spread(term, estimate) 
    
    ggplot(faithful, aes(eruptions, waiting)) +
      geom_abline(data = tidied_wide %>%
         sample_n(1000), # 10k lines look about the same as 1k, just darker and slower
         aes(slope =  eruptions, intercept = `(Intercept)`, group = id), alpha = 0.05) +
      geom_point(shape = 19, color = "red") # spot_color wasn't provided in OP
    

    【讨论】:

    • 谢谢你,乔恩!我没有看到在对 geom_abline() 的调用中包含整个管道的示例。这开辟了很多领域。期待在这些示例的基础上再接再厉。
    猜你喜欢
    • 1970-01-01
    • 2020-03-05
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多