如何解释逻辑回归的系数答案

【问题标题】：How to interpret coefficients of logistic regression如何解释逻辑回归的系数
【发布时间】：2021-08-03 13:43:56
【问题描述】：

我正在尝试找出多项式项的逻辑回归系数与预测之间的关系。具体来说，我对 x 轴上预测最高的位置感兴趣。示例如下：

set.seed(42)

# Setup some dummy data
x <- 1:200
y <- rep(0, length(x))
y[51:150] <- rbinom(100, 1, 0.5)

# Fit a model
family <- binomial()
model  <- glm(y ~ poly(x, 2), family = family)

# Illustrate model
plot(x, y)
lines(x, family$linkinv(predict(model)), col = 2)

上面的模型给了我这些系数：

coef(model)
#> (Intercept) poly(x, 2)1 poly(x, 2)2 
#>   -1.990317   -3.867855  -33.299893

^{由reprex package (v1.0.0) 于 2021 年 8 月 3 日创建}

poly() 的手册页声明如下：

正交多项式由系数汇总，可用于通过 Kennedy & Gentle (1980, pp. 343–4) 中给出的三项递归对其进行评估，并用于代码的预测部分。

但是，我无法访问这本书，也无法从predict.glm S3 方法中辨别出这些系数是如何处理的。有没有办法仅从系数（即不使用predict() 来找到最大值）重建峰顶的位置（示例中约为 100）？

【问题讨论】：

我确实觉得这个问题很有趣。但是这怎么没有迁移到 Stackexchange 呢？我只是好奇
我不确定。我能问一下为什么您认为这个问题可能更适合其他网站吗？
嗯 - 标题和第一部分与逻辑回归本身有关，但最后一部分是“有没有办法从单独的系数（即不使用 predict() 来找到最大值）”与编码相关 - 所以我猜它是一个灰色区域，人们比我更有经验来判断它是否适合 SO。显然是这样。
我真的只是好奇——这绝不是被动的攻击性评论！
哦，如果我造成了混乱，我深表歉意，我并不想让人觉得生气或任何事情！我在这里发布它的原因是因为许多 R 用户具有数据分析或统计方面的背景，因此可能能够帮助我处理逻辑回归位。我一直在寻找 R 中的解决方案，因为我不想去例如matlab 或 python 来解决我知道应该可以在我的项目其余部分所在的 R 中解决的问题。

标签： r logistic-regression glm

【解决方案1】：

从正交多项式的理论表达式推导预测最大值的位置

我在poly 的文档中引用了 Kennedy 和 Gentle (1982) 的“统计计算”一书，现在分享我关于正交多项式计算的发现，以及我们如何使用它们来找到最大预测值的位置 x。

书中介绍的正交多项式（第 343-4 页）是一元的（即最高阶系数始终为 1），并通过以下递归过程获得：

其中 q 是所考虑的正交多项式的数量。

请注意以下上述术语的关系与poly的文档：

您问题的摘录中出现的 “三项递归” 是第三个表达式的 RHS，它正好包含三个项。
第三个表达式中的 rho(j+1) 系数称为“居中常数”。
第三个表达式中的 gamma(j) 系数在文档中没有名称，但它们与 “归一化常数” 直接相关，如下所示.

作为参考，这里我贴上poly文档的“价值”部分的相关摘录：

一个矩阵，其行对应于 x 中的点，列对应于度数，属性“degree”指定列的度数，（除非 raw = TRUE）“coefs”包含用于构造的居中和归一化常数正交多项式

回到递归，我们可以从第三个表达式，通过对 p(j+1) w.r.t 施加正交性条件。 p(j) 和 p(j-1)。
（需要注意的是，正交性条件不是积分，而是n个观察到的x个点的总和，因此多项式系数取决于在数据上！--例如，对于 Tchebyshev 正交多项式，情况并非如此）。

参数表达式变为：

对于回归中使用的 1 阶和 2 阶多项式，我们得到以下表达式，这些表达式已经用 R 代码编写：

# First we define the number of observations in the data
n = length(x)

# For p1(x):
# p1(x) = (x - rho1) p0(x)      (since p_{-1}(x) = 0)
rho1 = mean(x)

# For p2(x)
# p2(x) = (x - rho2) p1(x) - gamma1
gamma1 = var(x) * (n-1)/n
rho2 = sum( x * (x - mean(x))^2 ) / (n*gamma1)

我们得到的：

> c(rho1, rho2, gamma1)
[1]  100.50  100.50 3333.25

注意poly(x,2)的coefs属性是：

> attr(poly(x,2), "coefs")
$alpha
[1] 100.5 100.5

$norm2
[1]          1        200     666650 1777555560

其中$alpha 包含 居中常数，即 rho 值（与我们的一致——顺便说一句，当x 的分布是对称的 对于任何 q!（观察和证明），并且$norm2 包含 归一化常数（在这种情况下为p(-1,x)、p(0,x)、p(1,x)和p(2 ,x))，也就是将递归公式中的多项式归一化的常数c(j)——通过将它们除以sqrt(c(j)) em>--，使生成的多项式 r(j,x) 满足 sum_over_i{ r(j,x_i)^2 } = 1; 注意r(j,x)是存储在poly()返回的对象中的多项式。

从上面已经给出的表达式，我们观察到 gamma(j) 恰好是两个连续归一化常数的比率，即： gamma(j) = c(j) / c (j-1).

我们可以通过计算检查我们的gamma1值是否与这个比率一致：

gamma1 == attr(poly(x,2), "coefs")$norm2[3] / attr(poly(x,2), "coefs")$norm2[2]

返回TRUE。

回到找到模型预测值的最大值的问题，我们可以：

将预测值表示为 r(1,x) 和 r(2,x) 以及逻辑回归系数的函数，即：

pred(x) = beta0 + beta1 * r(1,x) + beta2 * r(2,x)
导出表达式 w.r.t. x，将其设置为 0 并求解 x。

在 R 代码中：

# Get the normalization constants alpha(j) to obtain r(j,x) from p(j,x) as
# r(j,x) = p(j,x) / sqrt( norm(j) ) = p(j,x) / alpha(j)
alpha1 = sqrt( attr(poly(x,2), "coefs")$norm2[3] )
alpha2 = sqrt( attr(poly(x,2), "coefs")$norm2[4] )

# Get the logistic regression coefficients (beta1 and beta2)
coef1 = as.numeric( model$coeff["poly(x, 2)1"] )
coef2 = as.numeric( model$coeff["poly(x, 2)2"] )

# Compute the x at which the maximum occurs from the expression that is obtained
# by deriving the predicted expression pred(x) = beta0 + beta1*r(1,x) + beta2*r(2,x)
# w.r.t. x and setting the derivative to 0.
xmax = ( alpha2^-1 * coef2 * (rho1 + rho2) - alpha1^-1 * coef1 ) / (2 * alpha2^-1 * coef2)

给出：

> xmax
[1] 97.501114

即与我的previous answer 中描述的其他“经验”方法获得的值相同。

从您提供的代码开始，获取预测值最大值的位置 x 的完整代码是：

# First we define the number of observations in the data
n = length(x)

# Parameters for p1(x):
# p1(x) = (x - rho1) p0(x)      (since p_{-1}(x) = 0)
rho1 = mean(x)

# Parameters for p2(x)
# p2(x) = (x - rho2) p1(x) - gamma1
gamma1 = var(x) * (n-1)/n
rho2 = mean( x * (x - mean(x))^2 ) / gamma1

# Get the normalization constants alpha(j) to obtain r(j,x) from p(j,x) as
# r(j,x) = p(j,x) / sqrt( norm(j) ) = p(j,x) / alpha(j)
alpha1 = sqrt( attr(poly(x,2), "coefs")$norm2[3] )
alpha2 = sqrt( attr(poly(x,2), "coefs")$norm2[4] )

# Get the logistic regression coefficients (beta1 and beta2)
coef1 = as.numeric( model$coeff["poly(x, 2)1"] )
coef2 = as.numeric( model$coeff["poly(x, 2)2"] )

# Compute the x at which the maximum occurs from the expression that is obtained
# by deriving the predicted expression pred(x) = beta0 + beta1*r(1,x) + beta2*r(2,x)
# w.r.t. x and setting the derivative to 0.
( xmax = ( alpha2^-1 * coef2 * (rho1 + rho2) - alpha1^-1 * coef1 ) / (2 * alpha2^-1 * coef2) )

【讨论】：

好东西，感谢您努力寻找这本书并弄清楚这与我的问题有何关系！我已经悬赏你的答案，但我必须等待一天才能奖励赏金。
太好了，我很高兴您发现这个答案很有用，感谢您的反馈和请求赏金！ :-)

【解决方案2】：

假设您想分析地找到正交多项式为 1 阶和 2 阶的特殊情况下的最大预测值，我建议采用以下方法：

总结

1) 推断多项式系数

这可以通过将线性模型拟合到模型矩阵中包含的相应多项式值来轻松完成。

2) 导出预测表达式 w.r.t. x 并将导数设置为 0

求解从 (1) 中的多项式拟合推断出的预测表达式中的x，得到预测最大值出现的x 的值。

详情

1) 多项式系数

根据您拟合 GLM 模型的行，我们估计 1 阶多项式 p1(x) = a0 + a1*x 的系数和 2 阶多项式 p2(x) = b0 + b1*x + b2*x^2 的系数：

X = model.matrix(model)
p1 = X[, "poly(x, 2)1"]
p2 = X[, "poly(x, 2)2"]

p1.lm = lm(p1 ~ x)
a0 = p1.lm$coeff["(Intercept)"]
a1 = p1.lm$coeff["x"]

p2.lm = lm(p2 ~ x + I(x^2))
b0 = p2.lm$coeff["(Intercept)"]
b1 = p2.lm$coeff["x"]
b2 = p2.lm$coeff["I(x^2)"]

这给出了：

> c(a0, a1, b0, b1, b2)
   (Intercept)              x    (Intercept)              x         I(x^2) 
-1.2308840e-01  1.2247602e-03  1.6050353e-01 -4.7674315e-03  2.3718565e-05

2) 预测的导数找到最大值

预测表达式z（在反向链接函数之前）为：

z = Intercept + coef1 * p1(x) + coef2 * p2(x)

我们推导出这个表达式并将其设置为0以获得：

coef1 * a1 + coef2 * (b1 + 2 * b2 * xmax) = 0

求解xmax 我们得到：

xmax = - (coef1 * a1 + coef2 * b1) / (2 * coef2 * b2)

在 R 代码中，计算如下：

coef1 = as.numeric( model$coeff["poly(x, 2)1"] )
coef2 = as.numeric( model$coeff["poly(x, 2)2"] )
(xmax = - ( coef1 * a1 + coef2 * b1 ) / (2 * coef2 * b2))

给出：

        x 
97.501114

检查

我们可以通过将其作为绿色十字添加到预测曲线来验证最大值：

# Prediction curve computed analytically
Intercept = model$coeff["(Intercept)"]
pred.analytical = family$linkinv( Intercept + coef1 * p1 + coef2 * p2 )

# Find the prediction's maximum analytically
pred.max = family$linkinv( Intercept + coef1 * (a0 + a1 * xmax) +
                                       coef2 * (b0 + b1 * xmax + b2 * xmax^2) )


# Plot
plot(x, y)
# The following two lines should coincide!
lines(x, pred.analytical, col = 3)
lines(x, family$linkinv(predict(model)), col = 2)
# Location of the maximum!
points(xmax, pred.max, pch="x", col="green")

给出：

【讨论】：