sklearn 或 python 中更快的 AUC答案

【问题标题】：Faster AUC in sklearn or pythonsklearn 或 python 中更快的 AUC
【发布时间】：2017-01-03 13:57:13
【问题描述】：

我有超过 50 万对真实标签和预测分数（每个 1d 数组的长度各不相同，长度可能在 10,000-30,000 之间），我需要为其计算 AUC。现在，我有一个 for 循环调用：

# Simple Example with two pairs of true/predicted values instead of 500,000
from sklearn import metrics
import numpy as np

pred = [None] * 2
pred[0] = np.array([3,2,1])
pred[1] = np.array([15,12,14,11,13])

true = [None] * 2
true[0] = np.array([1,0,0])
true[1] = np.array([1,1,1,0,0])

for i in range(2):
    fpr, tpr, thresholds = metrics.roc_curve(true[i], pred[i])
    print metrics.auc(fpr, tpr)

但是，处理整个数据集并计算每个真/预测对的 AUC 大约需要 1-1.5 小时。有没有更快/更好的方法来做到这一点？

更新

500k 个条目中的每一个都可以具有形状 (1, 10k+)。我知道我可以并行化它，但是我被困在只有两个处理器的机器上，所以我的时间实际上只能有效地减少到 30-45 分钟，这仍然太长了。我发现 AUC 计算本身很慢，并希望找到比 sklearn 中可用的更快的 AUC 算法。或者，至少，找到一种更好的方法来矢量化 AUC 计算，以便它可以跨多行广播。

【问题讨论】：

所以要清楚，您的 pred 和 true 数组的长度为 500k 条目，每个条目都是形状为 (10k, 1) 的 np.array？如果是，您想计算每个 pred[i|/true[i] 组合的 auc，即每个计算都是独立的？
计算一对AUC需要多少时间？每对可以允许多少时间t 来满足您的需要（即t * 0.5e6 < T_max？我假设您已经查看了花费时间优化此与成本的成本/收益比购买一台更大的机器（或在计算时在 AWS 租一台）。

标签： python scikit-learn data-science auc

【解决方案1】：

有没有更快/更好的方法来做到这一点？

由于每个 true/pred 对的计算是独立的（如果我理解您的设置），您应该能够通过使用 multiprocessing 来减少总处理时间，从而有效地并行计算：

import multiprocessing as mp

def roc(v):
    """ calculate one pair, return (index, auc) """
    i, true, pred = v
    fpr, tpr, thresholds = metrics.roc_curve(true, pred, drop_intermediate=True)
    auc = metrics.auc(fpr, tpr)
    return i, auc

pool = mp.Pool(3) 
result = pool.map_async(roc, ((i, true[i], pred[i]) for i in range(2)))
pool.close()
pool.join()
print result.get()
=>
[(0, 1.0), (1, 0.83333333333333326)]

这里Pool(3) 创建一个包含 3 个进程的池，.map_async 映射所有 true/pred 对并调用 roc 函数，一次传递一对。发送索引以映射回结果。

如果 true/pred 对太大而无法序列化并发送到进程，您可能需要在调用 roc 之前将数据写入某些外部数据结构，仅传递引用 i 并读取数据在处理之前为roc 中的每一对true[i]/pred[i]。

Pool 自动管理进程的调度。为了降低内存占用的风险，您可能需要传递 Pool(...., maxtasksperchild=1) 参数，该参数将为每个 true/pred 对启动一个新进程（选择您认为合适的任何其他数字）。

更新

我被困在只有两个处理器的机器上

这自然是一个限制因素。但是，考虑到云计算资源的可用性以非常合理的成本，您只需为实际需要的时间付费，您可能需要在花费数小时优化可以如此有效地并行化的计算之前考虑硬件中的替代方案。这本身就是一种奢侈，真的。

【讨论】：

【解决方案2】：

找到一种更好的方法来向量化 AUC 计算，以便它可以跨多行广播

可能不会 - sklearn 已经使用高效的 numpy 操作来计算相关部分：

# -- calculate tps, fps, thresholds
# sklearn.metrics.ranking:_binary_clf_curve()
(...)
distinct_value_indices = np.where(np.logical_not(isclose(
        np.diff(y_score), 0)))[0]
threshold_idxs = np.r_[distinct_value_indices, y_true.size - 1]
# accumulate the true positives with decreasing threshold
tps = (y_true * weight).cumsum()[threshold_idxs]
if sample_weight is not None:
    fps = weight.cumsum()[threshold_idxs] - tps
else:
    fps = 1 + threshold_idxs - tps
return fps, tps, y_score[threshold_idxs]

# -- calculate auc
# sklearn.metrics.ranking:auc()
...
area = direction * np.trapz(y, x)
...

您可以通过分析这些函数并预先删除可以更有效地应用的操作来优化这一点。对扩展到 500 万行的示例输入进行快速分析会发现一些潜在的瓶颈（标记为 >>>）：

# your for ... loop wrapped in function roc()
%prun -s cumulative roc
722 function calls (718 primitive calls) in 5.005 seconds
Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    5.005    5.005 <string>:1(<module>)
        1    0.000    0.000    5.005    5.005 <ipython-input-51-27e30c04d997>:1(roc)
        2    0.050    0.025    5.004    2.502 ranking.py:417(roc_curve)
        2    0.694    0.347    4.954    2.477 ranking.py:256(_binary_clf_curve)
     >>>2    0.000    0.000    2.356    1.178 fromnumeric.py:823(argsort)
     >>>2    2.356    1.178    2.356    1.178 {method 'argsort' of 'numpy.ndarray' objects}
        6    0.062    0.010    0.961    0.160 arraysetops.py:96(unique)
     >>>6    0.750    0.125    0.750    0.125 {method 'sort' of 'numpy.ndarray' objects}
     >>>2    0.181    0.090    0.570    0.285 numeric.py:2281(isclose)
        2    0.244    0.122    0.386    0.193 numeric.py:2340(within_tol)
        2    0.214    0.107    0.214    0.107 {method 'cumsum' of 'numpy.ndarray' objects}

【讨论】：

【解决方案3】：

我想出了一种矢量化方法来计算 ROC AUC。它比 sklearns 实现更快。假设你有 N 个例子，N_pos (+1) 和 N_neg (-1)。

所以 N = N_pos + N_neg

计算一个索引向量 (I)：[1, 2, 3, 4,..., N]

计算排名向量（R）：[1, 0, 1, ..., 1]

排名向量是您的 0 和 1 因您的排名得分而结束的位置（例如，来自模型的预测概率）。第一名表示该标签得分最高。

现在我们有了计算 ROC AUC 的矢量化方法：

查看此处了解详细信息以及与 sklearn 的比较：https://medium.com/building-ibotta/understanding-roc-auc-part-2-2-a1e418a3afdb

【讨论】：