确定R中glm逻辑回归模型的阈值答案

【问题标题】：Deciding threshold for glm logistic regression model in R确定R中glm逻辑回归模型的阈值
【发布时间】：2014-06-08 01:02:16
【问题描述】：

我有一些带有预测变量和二进制目标的数据。例如：

df <- data.frame(a=sort(sample(1:100,30)), b= sort(sample(1:100,30)), 
                 target=c(rep(0,11),rep(1,4),rep(0,4),rep(1,11)))

我使用glm()训练了一个逻辑回归模型

model1 <- glm(formula= target ~ a + b, data=df, family=binomial)

现在我正在尝试预测输出（例如，相同的数据就足够了）

predict(model1, newdata=df, type="response")

这会生成一个概率数向量。但我想预测实际的班级。我可以在概率数上使用 round() ，但这假设低于 0.5 的任何东西都是“0”类，而高于 0.5 的任何东西都是“1”类。这是一个正确的假设吗？即使每个班级的人口可能不相等（或接近相等）？或者有没有办法估计这个阈值？

【问题讨论】：

有不同的标准，例如敏感性和特异性之和最大的点，例如这个问题：stackoverflow.com/questions/23131897/…
@adibender 谢谢！但是将阈值用作人口比例肯定是不正确的，对吗？也就是说，如果在总体中，30% 的病例是“0”，70% 的病例是“1”，那么幼稚的估计就是使用 0.3 作为阈值。但这不是解决这个问题的合乎逻辑的方法吗？
您可以在这里找到关于该主题的精彩教程：hopstat.wordpress.com/2014/12/19/…

标签： r glm predict logistic-regression

【解决方案1】：

要以最接近的灵敏度和特异性值（即上图中的交叉点）获取数据中的阈值，您可以使用以下代码，该代码非常接近：

predictions = prediction(PREDS, LABELS)

sens = cbind(unlist(performance(predictions, "sens")@x.values), unlist(performance(predictions, "sens")@y.values))
spec = cbind(unlist(performance(predictions, "spec")@x.values), unlist(performance(predictions, "spec")@y.values))
sens[which.min(apply(sens, 1, function(x) min(colSums(abs(t(spec) - x))))), 1]

【讨论】：

【解决方案2】：

PresenceAbsence包的函数PresenceAbsence::optimal.thresholds中实现了12个方法。

Freeman, E. A. 和 Moisen, G. G. (2008) 也对此进行了介绍。在预测流行率和 kappa 方面比较二元分类阈值标准的性能。生态建模，217（1-2），48-58。

【讨论】：

【解决方案3】：

围绕尝试复制第一张图进行工具化。给定一个predictions <- prediction(pred,labels) 对象，那么：

baseR 方法

plot(unlist(performance(predictions, "sens")@x.values), unlist(performance(predictions, "sens")@y.values), 
     type="l", lwd=2, ylab="Specificity", xlab="Cutoff")
par(new=TRUE)
plot(unlist(performance(predictions, "spec")@x.values), unlist(performance(predictions, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2),labels=z)
mtext("Specificity",side=4, padj=-2, col='red')

ggplot2 方法

sens <- data.frame(x=unlist(performance(predictions, "sens")@x.values), 
                   y=unlist(performance(predictions, "sens")@y.values))
spec <- data.frame(x=unlist(performance(predictions, "spec")@x.values), 
                   y=unlist(performance(predictions, "spec")@y.values))

sens %>% ggplot(aes(x,y)) + 
  geom_line() + 
  geom_line(data=spec, aes(x,y,col="red")) +
  scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
  labs(x='Cutoff', y="Sensitivity") +
  theme(axis.title.y.right = element_text(colour = "red"), legend.position="none")

【讨论】：

【解决方案4】：

您可以尝试以下方法：

perfspec <- performance(prediction.obj = pred, measure="spec", x.measure="cutoff")

plot(perfspec)

par(new=TRUE)

perfsens <- performance(prediction.obj = pred, measure="sens", x.measure="cutoff")

plot(perfsens)

【讨论】：

【解决方案5】：

在 glm 模型中使用的最佳阈值（或截止）点是最大化特异性和灵敏度的点。这个阈值点可能不会给出模型中的最高预测，但不会偏向正面或负面。 ROCR 包包含可以帮助您执行此操作的函数。检查此包中的performance() 函数。它会让你得到你想要的东西。这是您期望得到的图片：

找到截止点后，我通常会自己编写一个函数来查找其预测值高于截止点的数据点的数量，并将其与它们所属的组匹配。

【讨论】：

您能否提供更具体的代码来生成上述图表？此外，对于取值介于 0 和 1 之间的概率，截断值如何介于 0 和 14 之间？
我在下面添加了 baseR/ggplot 方法！

【解决方案6】：

确定良好模型参数（包括逻辑回归的“我应该设置什么阈值”）的黄金标准是交叉验证。

一般的想法是保留您的训练集的一个或多个部分，并选择使该保留集上正确分类数量最大化的阈值，但Wikipedia 可以为您提供更多详细信息。

【讨论】：

既然我们将在交叉验证数据上调整阈值参数，表面上，这将需要第三个保留集进行评估以报告无偏的预期错误？
@user2175594，是的，这是正确的。传统上，您将至少拥有三个独立的数据分区：训练、验证和测试（评估）。但是，如果您正在执行诸如 k 折交叉验证之类的操作，那么训练和验证本质上就是以多种方式重新划分的同一集合。