Sklearn 的 roc_auc_score 用于多标签二元分类答案

【问题标题】：Sklearn's roc_auc_score for multilabel binary classificationSklearn 的 roc_auc_score 用于多标签二元分类
【发布时间】：2018-08-01 17:38:11
【问题描述】：

用 MWE 改写：

我正在尝试计算roc_auc_score。

这是我得到的错误：

Traceback (most recent call last):
  File "Feb22so.py", line 58, in <module>
    test_roc_vals(od)   
  File "Feb22so.py", line 29, in test_roc_vals
    roc_values.append(roc_auc_score(target, pred))
  File "/user/pkgs/anaconda2/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 260, in roc_auc_score
    sample_weight=sample_weight)
  File "/user/pkgs/anaconda2/lib/python2.7/site-packages/sklearn/metrics/base.py", line 127, in _average_binary_score
    sample_weight=score_weight)
  File "/user/pkgs/anaconda2/lib/python2.7/site-packages/sklearn/metrics/ranking.py", line 251, in _binary_roc_auc_score
    raise ValueError("Only one class present in y_true. ROC AUC score "
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.

这是我的代码的 MWE 版本。

from scipy.sparse import csr_matrix
import numpy as np
from collections import OrderedDict
from sklearn.metrics import roc_auc_score

def test_roc_vals(od):
        #od will be an OrderedDict with integer keys and scipy.sparse.csr_matrix OR list values
        #if the value is a list, it will be empty.
        #a scipy.sparse.csr_matrix may have only 0s or only 1s
        roc_values = []
        for i in range(len(od.keys())-1):
                print "i is: ", i,
                target = od[od.keys()[i+1]]
                pred = od[od.keys()[i]]

                if isinstance(target, list) or isinstance(pred, list):
                        print 'one of them is a list: cannot compute roc_auc_score'
                        continue
                else:   
                        target = target.toarray()
                        pred = pred.toarray()
                        if len(np.unique(target)) != 2 or len(np.unique(pred)) !=2:
                                print 'either target or pred or both contain only one class: cannot compute roc_auc_score'
                                continue
                        else:   
                                roc_values.append(roc_auc_score(target, pred))
        return 0

if __name__ == '__main__':

        #Generate some fake data
        #This makes an OrderedDict of 20 scipy.sparse.csr_matrix objects, with 10 rows and 10 columns and binary values
        od = OrderedDict()
        for i in range(20):
                row = np.random.randint(10, size=10)
                col = np.random.randint(10, size=10)
                data = np.random.randint(2, size=10)
                sp_matrix = csr_matrix((data, (row, col)), shape=(10, 10))
                od[i] = sp_matrix

        #Now let's include some empty lists at the end of the Ordered Dict.

        for j in range(20, 23):
                od[j] = []

        #Calling the roc_auc_score function on all non-list values that have at least one instance of each 0/1 class
        test_roc_vals(od)

我不明白为什么我的 if/else 没有捕捉到“只有一个类”的实例。或者可能是，错误是由其他原因引起的？

旧：

我找不到这个in the docs。 sklearn 中roc_auc_score 的每个类是否有最小实例数？

即使我在代表性不足的班级中有 10 个示例，我也无法计算它。

【问题讨论】：

“遇到麻烦”是什么意思？发布代码和示例数据以及完整的错误（如果有）。
@VivekKumar 请查看编辑。我已经包含了错误和 MWE。
好的，我的 MWE 出现了一些问题。事实证明，根据这个 (stackoverflow.com/questions/48931762/…)，我对稀疏矩阵的行和列的初始化导致数据部分中出现了一些“2”。另外，我应该在 true 和 pred 值上调用 flatten()。

标签： python scikit-learn

【解决方案1】：

有两件事是错误的：

1) 对于多标签设置，不要忘记使用flatten()。

2) 在生成 MWE 数据时，调用 csr_matrix 的初始化使用 coo_matrix 并根据 sascha's answer 对同一行/列索引中的所有重复值求和。

从 scipy.sparse 导入 csr_matrix 将 numpy 导入为 np 从集合导入 OrderedDict 从 sklearn.metrics 导入 roc_auc_score

def test_roc_vals(od):
        #od will be an OrderedDict with integer keys and scipy.sparse.csr_matrix OR list values
        #if the value is a list, it will be empty.
        #a scipy.sparse.csr_matrix may have only 0s or only 1s
        roc_values = []
        for i in range(len(od.keys())-1):
                print "i is: ", i,
                target = od[od.keys()[i+1]]
                pred = od[od.keys()[i]]

            if isinstance(target, list) or isinstance(pred, list):
                    print 'one of them is a list: cannot compute roc_auc_score'
                    continue
            else:   
                    target = target.toarray().flatten()
                    pred = pred.toarray().flatten()
                    if len(np.unique(target)) != 2 or len(np.unique(pred)) !=2:
                            print 'either target or pred or both contain only one class: cannot compute roc_auc_score'
                            continue
                    else:   
                            roc_values.append(roc_auc_score(target, pred))

    return roc_values

if __name__ == '__main__':

    #Generate some fake data
    #This makes an OrderedDict of 20 scipy.sparse.csr_matrix objects, with 10 rows and 10 columns and binary values
    od = OrderedDict()
    for i in range(20):
            row = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
            col = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
            data = np.random.randint(2, size=9)
            sp_matrix = csr_matrix((data, (row, col)), shape=(3, 3))
            od[i] = sp_matrix

    #Now let's include some empty lists at the end of the Ordered Dict.

    for j in range(20, 23):
            od[j] = []

    #Calling the roc_auc_score function on all non-list values that have at least one instance of each 0/1 class
    rocvals = test_roc_vals(od)
    print rocvals

或者，更简洁地说，您可以使用 try/except 代替多个 if 语句，如下所示：

        try:
                roc_values.append(roc_auc_score(target.toarray().flatten(), pred.toarray().flatten()))
        except: 
                continue

【讨论】：

【解决方案2】：

您的y_true 集应该有多种标签。例如，它应该是y_true = [1,1,0,0] 而不是y_true =[1,1,1,1] 或y_true =[0,0,0,0]。

【讨论】：