sklearn LogisticRegression 和更改分类的默认阈值答案

【问题标题】：sklearn LogisticRegression and changing the default threshold for classificationsklearn LogisticRegression 和更改分类的默认阈值
【发布时间】：2015-10-03 17:33:12
【问题描述】：

我正在使用 sklearn 包中的 LogisticRegression，并且有一个关于分类的快速问题。我为我的分类器构建了一条 ROC 曲线，结果证明我的训练数据的最佳阈值约为 0.25。我假设创建预测时的默认阈值为 0.5。在进行 10 倍交叉验证时，如何更改此默认设置以找出模型的准确度？基本上，我希望我的模型为大于 0.25 而不是 0.5 的任何人预测“1”。我一直在查看所有文档，但似乎找不到任何地方。

【问题讨论】：

标签： python scikit-learn classification regression

【解决方案1】：

我想给出一个实际的答案

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, recall_score, roc_auc_score, precision_score

X, y = make_classification(
    n_classes=2, class_sep=1.5, weights=[0.9, 0.1],
    n_features=20, n_samples=1000, random_state=10
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

clf = LogisticRegression(class_weight="balanced")
clf.fit(X_train, y_train)
THRESHOLD = 0.25
preds = np.where(clf.predict_proba(X_test)[:,1] > THRESHOLD, 1, 0)

pd.DataFrame(data=[accuracy_score(y_test, preds), recall_score(y_test, preds),
                   precision_score(y_test, preds), roc_auc_score(y_test, preds)], 
             index=["accuracy", "recall", "precision", "roc_auc_score"])

通过将THRESHOLD 更改为0.25，可以发现recall 和precision 的分数在下降。但是，通过删除 class_weight 参数，accuracy 会增加，但 recall 分数会下降。参考@accepted 答案

【讨论】：

当我尝试这个时，我得到一个错误name np is not defined wht is np?
np 是 numpy : import numpy as np

【解决方案2】：

这不是内置功能。您可以通过将 LogisticRegression 类包装在您自己的类中来“添加”它，并添加您在自定义 predict() 方法中使用的 threshold 属性。

但是，一些注意事项：

默认阈值实际上是 0。LogisticRegression.decision_function() 返回到所选分离超平面的有符号距离。如果您正在查看predict_proba()，那么您正在查看阈值为 0.5 的超平面距离的 logit()。但这计算起来成本更高。
通过选择像这样的“最佳”阈值，您正在利用学习后的信息，这会破坏您的测试集（即，您的测试或验证集不再提供样本外错误的无偏估计）。因此，您可能会导致额外的过拟合，除非您仅在训练集的交叉验证循环内选择阈值，然后将它和经过训练的分类器与您的测试集一起使用。
如果您遇到不平衡问题，请考虑使用class_weight，而不是手动设置阈值。这应该会迫使分类器选择一个离重要类别更远的超平面。

【讨论】：

我遇到了类似的问题，我的假阴性和真阴性都非常低。是否可以通过参数使 logit 函数（sigmoid 函数）中的z 输入偏斜，方法是在“z = 2”时将概率设为 0.5，而不是在“z = 0”时设为 0.5？谢谢。
还是没有办法改变决策阈值吗？

【解决方案3】：

您可以更改阈值，但它是 0.5，以便计算正确。如果你有一个不平衡的集合，分类如下图所示。

您可以看到第 1 类的预期非常糟糕。 1级占人口的2%。在将结果变量平衡在 50% 到 50%（使用过采样）之后，0.5 阈值位于图表的中心。

【讨论】：

【解决方案4】：

为了完整起见，我想提一下另一种基于 scikit 的概率计算优雅地生成预测的方法using binarize：

import numpy as np
from sklearn.preprocessing import binarize

THRESHOLD = 0.25

# This probabilities would come from logistic_regression.predict_proba()
y_logistic_prob =  np.random.uniform(size=10)

predictions = binarize(y_logistic_prob.reshape(-1, 1), THRESHOLD).ravel()

此外，我同意the considerations that Andreus makes，特别是2和3。一定要留意他们。

【讨论】：

【解决方案5】：

def find_best_threshold(threshould, fpr, tpr):
   t = threshould[np.argmax(tpr*(1-fpr))]
   # (tpr*(1-fpr)) will be maximum if your fpr is very low and tpr is very high
   print("the maximum value of tpr*(1-fpr)", max(tpr*(1-fpr)), "for threshold", np.round(t,3))
   return t

如果你想找到最好的真阳性率和nagatuve率，可以使用这个函数

【讨论】：

要使用它，您首先调用fpr, tpr, threshold = sklearn.metrics.roc_curve(y_true, y_scores)，然后调用find_best_threshold(threshold, fpr, tpr)

【解决方案6】：

就我的算法而言，好的：

threshold = 0.1
LR_Grid_ytest_THR = ((model.predict_proba(Xtest)[:, 1])>= threshold).astype(int)

和：

print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print('Valuation for test data only  (new_threshold):')
    print(classification_report(ytest, LR_Grid_ytest_THR))

【讨论】：

您应该将此答案与您的其他答案结合起来。仅此一点没有多大意义！

【解决方案7】：

特例：一维逻辑回归

使用以下公式计算将样本X 标记为1 和其标记为0 的区域之间的分隔值：

from scipy.special import logit
thresh = 0.1
val = (logit(thresh)-clf.intercept_)/clf.coef_[0]

因此，可以更直接地计算预测

preds = np.where(X>val, 1, 0)

【讨论】：