如果您特别需要为不同的阈值生成 ROC 曲线,一种方法可能是生成您感兴趣的阈值列表,并针对每个阈值在您的数据集上进行拟合/转换。或者您可以使用来自model.transform(test) 的响应中的probability 字段手动计算每个阈值点的ROC 曲线。
或者,您可以使用BinaryClassificationMetrics 提取曲线,按阈值绘制各种指标(F1 分数、精度、召回率)。
不幸的是,PySpark 版本似乎没有实现 Scala 版本的大部分方法,因此您需要将类包装在 Python 中。
例如:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
用法:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
points = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
结果:
如果您未与 ROC 结婚,以下是按阈值划分的 F1 分数曲线示例: