【问题标题】:Make ggplot with regression line and normal distribution overlay使用回归线和正态分布叠加制作ggplot
【发布时间】:2020-06-24 09:25:16
【问题描述】:

我正在尝试绘制一个图来展示逻辑(或概率)回归背后的直觉。我如何在 ggplot 中制作一个看起来像这样的情节?

(Wolf & Best,The Sage Handbook of Regression Analysis and Causal Inference,2015 年,第 155 页)

实际上,我什至宁愿做的是沿 y 轴显示一个单一的正态分布,均值 = 0,以及一个特定的方差,这样我就可以画出从线性预测器到 y 轴和横向的水平线正态分布。像这样的:

这是应该显示的(假设我没有误解某些内容)是。到目前为止,我还没有取得太大的成功......

library(ggplot2)

x <- seq(1, 11, 1)
y <- x*0.5

x <- x - mean(x)
y <- y - mean(y)

df <- data.frame(x, y)

# Probability density function of a normal logistic distribution 
pdfDeltaFun <- function(x) {
  prob = (exp(x)/(1 + exp(x))^2)
  return(prob)
}

# Tried switching the x and y to be able to turn the 
# distribution overlay 90 degrees with coord_flip()
ggplot(df, aes(x = y, y = x)) + 
  geom_point() + 
  geom_line() + 
  stat_function(fun = pdfDeltaFun)+ 
  coord_flip() 

【问题讨论】:

    标签: r ggplot2 logistic-regression


    【解决方案1】:

    我认为这与您给出的第一个插图非常接近。如果这是您不需要重复多次的事情,最好在绘制之前计算密度曲线并使用单独的数据框来绘制这些曲线。

    library(ggplot2)
    
    x <- seq(1, 11, 1)
    y <- x*0.5
    
    x <- x - mean(x)
    y <- y - mean(y)
    
    df <- data.frame(x, y)
    
    # For every row in `df`, compute a rotated normal density centered at `y` and shifted by `x`
    curves <- lapply(seq_len(NROW(df)), function(i) {
      mu <- df$y[i]
      range <- mu + c(-3, 3)
      seq <- seq(range[1], range[2], length.out = 100)
      data.frame(
        x = -1 * dnorm(seq, mean = mu) + df$x[i],
        y = seq,
        grp = i
      )
    })
    # Combine above densities in one data.frame
    curves <- do.call(rbind, curves)
    
    
    ggplot(df, aes(x, y)) +
      geom_point() +
      geom_line() +
      # The path draws the curve
      geom_path(data = curves, aes(group = grp)) +
      # The polygon does the shading. We can use `oob_squish()` to set a range.
      geom_polygon(data = curves, aes(y = scales::oob_squish(y, c(0, Inf)),group = grp))
    

    第二个插图与您的代码非常接近。我通过标准的正态密度函数简化了你的密度函数,并在 stat 函数中添加了一些额外的参数:

    library(ggplot2)
    
    x <- seq(1, 11, 1)
    y <- x*0.5
    
    x <- x - mean(x)
    y <- y - mean(y)
    
    df <- data.frame(x, y)
    
    ggplot(df, aes(x, y)) +
      geom_point() +
      geom_line() +
      stat_function(fun = dnorm,
                    aes(x = after_stat(-y * 4 - 5), y = after_stat(x)),
                    xlim = range(df$y)) +
      # We fill with a polygon, squishing the y-range
      stat_function(fun = dnorm, geom = "polygon",
                    aes(x = after_stat(-y * 4 - 5), 
                        y = after_stat(scales::oob_squish(x, c(-Inf, -1)))),
                    xlim = range(df$y))
    

    【讨论】:

    • 非常感谢您的快速帮助!我需要花一些时间来了解您所做的一切,但它们看起来都非常不错!