【问题标题】:Which is the correct way to calculate AUC with scikit-learn?用 scikit-learn 计算 AUC 的正确方法是什么?
【发布时间】:2021-05-29 12:46:17
【问题描述】:

我注意到以下两个代码的结果不同。

#1
metrics.plot_roc_curve(classifier, X_test, y_test, ax=plt.gca())


#2
metrics.plot_roc_curve(classifier, X_test, y_test, ax=plt.gca(), label=clsname + ' (AUC = %.2f)' % roc_auc_score(y_test, y_predicted))

那么,哪种方法是正确的?

我添加了一个简单的可重现示例:

from sklearn.metrics import roc_auc_score
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=12)

svclassifier = SVC(kernel='rbf')
svclassifier.fit(X_train, y_train)
y_predicted = svclassifier.predict(X_test)

print('AUC = %.2f' % roc_auc_score(y_test, y_predicted))  #1

metrics.plot_roc_curve(svclassifier, X_test, y_test, ax=plt.gca())  #2
plt.show()

输出(#1):

AUC = 0.86

而(#2):

【问题讨论】:

  • @Mr.T 我没见过。我应该删除我的问题吗?
  • #1 和#2 有什么区别?您只是在 #2 中添加标签,请参考plot_roc_curve,请参考matplotlib.pyplot. **kwargs label
  • @Shijith 我手动添加roc_auc_score 作为标签而不是自动图例以显示差异。请您详细说明一下吗?

标签: python scikit-learn classification metrics auc


【解决方案1】:

这里的区别可能是 sklearn 在内部使用predict_proba() 来获取每个类的概率,并从中找到 auc

例如,当您使用classifier.predict()

import matplotlib.pyplot as plt
from sklearn import datasets, metrics, model_selection, svm
X, y = datasets.make_classification(random_state=0)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
clf = svm.SVC(random_state=0,probability=False)
clf.fit(X_train, y_train)
clf.predict(X_test)

>> array([1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0,
       1, 0, 0])

# calculate auc
metrics.roc_auc_score(y_test, clf.predict(X_test))

>>>0.8782051282051283  # ~0.88

如果你使用classifier.predict_proba()

X, y = datasets.make_classification(random_state=0)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, random_state=0)
# set probability=True
clf = svm.SVC(random_state=0,probability=True)
clf.fit(X_train, y_train)
clf.predict_proba(X_test)

>> array([[0.13625954, 0.86374046],
       [0.90517034, 0.09482966],
       [0.19754525, 0.80245475],
       [0.96741274, 0.03258726],
       [0.80850602, 0.19149398],
       ......................,
       [0.31927198, 0.68072802],
       [0.8454472 , 0.1545528 ],
       [0.75919018, 0.24080982]])

# calculate auc
# when computing the roc auc metrics, by default, estimators.classes_[1] is   
# considered as the positive class here 'clf.predict_proba(X_test)[:,1]'

metrics.roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])
>> 0.9102564102564102

所以对于您的问题metrics.plot_roc_curve(classifier, X_test, y_test, ax=plt.gca()) 可能使用默认predict_proba() 来预测auc,而对于metrics.plot_roc_curve(classifier, X_test, y_test, ax=plt.gca(), label=clsname + ' (AUC = %.2f)' % roc_auc_score(y_test, y_predicted)),您正在计算roc_auc_score 并将分数作为标签传递。

【讨论】:

  • 谢谢。我认为在研究工作中使用.predict 更常见,对吧?
  • @DavidWs。不,ROC 曲线需要预测概率。使用硬类预测是不正确的。
猜你喜欢
  • 2017-03-20
  • 2016-12-31
  • 2018-09-24
  • 2014-10-01
  • 2014-09-20
  • 2017-04-21
  • 1970-01-01
  • 2012-05-07
  • 2014-06-12
相关资源
最近更新 更多