PySpark 和 MLLib：随机森林预测的类概率答案

【问题标题】：PySpark & MLLib: Class Probabilities of Random Forest PredictionsPySpark 和 MLLib：随机森林预测的类概率
【发布时间】：2015-05-03 07:55:37
【问题描述】：

我正在尝试提取我使用 PySpark 训练过的随机森林对象的类概率。但是，我在文档中的任何地方都没有看到它的示例，也不是RandomForestModel 的方法。

如何从 PySpark 中的 RandomForestModel 分类器中提取类概率？

这是文档中提供的示例代码，仅提供最终类（不是概率）：

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils

# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a RandomForest model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
#  Note: Use larger numTrees in practice.
#  Setting featureSubsetStrategy="auto" lets the algorithm choose.
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=3, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))

我没有看到任何model.predict_proba() 方法——我该怎么办？？

【问题讨论】：

晚了，但有一个带有 scala 解决方案的叉子：github.com/apache/spark/compare/master...mqk:master
这个问题现在（大部分）在新的 Spark ML 库中得到了解决：stackoverflow.com/questions/43631031/…

标签： apache-spark pyspark random-forest apache-spark-mllib

【解决方案1】：

也许人们会继续这篇文章，但我今天在尝试计算多类分类器针对训练集的准确度时遇到了同样的问题。因此，如果有人尝试使用 mllib，我想我会分享我的经验 ...

概率可以很容易地计算如下：-

# say you have a testset against which you want to run your classifier
   (trainingset, testset) =data.randomSplit([0.7, 0.3])
   # I converted the spark dataset containing the test data to pandas
     ptd=testData.toPandas()

   #Now get a count of number of labels matching the predictions

   correct = ((ptd.label-1) == (predictions)).sum() 
   # here we had to change the labels from 0-9 as opposed to 1-10 since
   #labels take the values from 0 .. numClasses-1

   m=ptd.shape[0]
   print((correct/m)*100)

【讨论】：

【解决方案2】：

现在可以使用了。

Spark ML 提供：

一个包含预测标签的predictionCol
和一个 probabilityCol，其中包含一个带有每个标签的概率的向量，这就是你要找的东西！李>
您还可以访问原始计数

有关更多详细信息，请参阅 Spark 文档： http://spark.apache.org/docs/latest/ml-classification-regression.html#output-columns-predictions

【讨论】：

确实 - 请参阅此处的示例：stackoverflow.com/questions/43631031/…

【解决方案3】：

不过，它将在 Spark 1.5.0 和新的 Spark-ML API 中可用。

【讨论】：

【解决方案4】：

据我所知，当前版本 (1.2.1) 不支持此功能。原生 Scala 代码 (tree.py) 上的 Python 包装器仅定义“预测”函数，这些函数又调用相应的 Scala 对应项 (treeEnsembleModels.scala)。后者通过在二元决策中进行投票来做出决策。一个更简洁的解决方案是提供一个概率预测，该预测可以任意阈值或用于 ROC 计算，如 sklearn。应该为将来的版本添加此功能！

作为一种解决方法，我将 predict_proba 实现为纯 Python 函数（参见下面的示例）。它既不优雅也不高效，因为它在森林中的一组单独的决策树上运行循环。诀窍 - 或者更确切地说是一个肮脏的黑客 - 是访问 Java 决策树模型的数组并将它们转换为 Python 对应物。之后，您可以计算单个模型对整个数据集的预测，并使用“zip”在 RDD 中累积它们的总和。除以树的数量得到所需的结果。对于大型数据集，主节点中少量决策树的循环应该是可以接受的。

由于将 Python 集成到 Spark（在 Java 中运行）的困难，下面的代码相当棘手。应该非常小心，不要将任何复杂的数据发送到工作节点，这会导致由于序列化问题而导致的崩溃。任何引用 Spark 上下文的代码都不能在工作节点上运行。此外，不能序列化引用任何 Java 代码的代码。例如，在下面的代码中使用 len(trees) 而不是 ntrees 可能很诱人 - 砰！用 Java/Scala 编写这样的包装器会更加优雅，例如通过在工作节点上的决策树上运行循环，从而降低通信成本。

下面的测试函数表明 predict_proba 给出的测试误差与原始示例中使用的 predict 相同。

def predict_proba(rf_model, data):
   '''
   This wrapper overcomes the "binary" nature of predictions in the native
   RandomForestModel. 
   '''

    # Collect the individual decision tree models by calling the underlying
    # Java model. These are returned as JavaArray defined by py4j.
    trees = rf_model._java_model.trees()
    ntrees = rf_model.numTrees()
    scores = DecisionTreeModel(trees[0]).predict(data.map(lambda x: x.features))

    # For each decision tree, apply its prediction to the entire dataset and
    # accumulate the results using 'zip'.
    for i in range(1,ntrees):
        dtm = DecisionTreeModel(trees[i])
        scores = scores.zip(dtm.predict(data.map(lambda x: x.features)))
        scores = scores.map(lambda x: x[0] + x[1])

    # Divide the accumulated scores over the number of trees
    return scores.map(lambda x: x/ntrees)

def testError(lap):
    testErr = lap.filter(lambda (v, p): v != p).count() / float(testData.count())
    print('Test Error = ' + str(testErr))


def testClassification(trainingData, testData):

    model = RandomForest.trainClassifier(trainingData, numClasses=2,
                                         categoricalFeaturesInfo={},
                                         numTrees=50, maxDepth=30)

    # Compute test error by thresholding probabilistic predictions
    threshold = 0.5
    scores = predict_proba(model,testData)
    pred = scores.map(lambda x: 0 if x < threshold else 1)
    lab_pred = testData.map(lambda lp: lp.label).zip(pred)
    testError(lab_pred)

    # Compute test error by comparing binary predictions
    predictions = model.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testError(labelsAndPredictions)

总而言之，这是学习 Spark 的一个很好的练习！

【讨论】：

谢谢。看起来不错，但是您的概率与我们在（二进制）响应特征上运行 RandomForest.trainRegressor() 并将模型的预测作为概率时不同。从概念上讲，您的方法与仅采用回归输出有何不同？
我没有考虑也没有使用随机森林进行回归。对于分类，可以简单地将正类的投票分数解释为概率，这正是我的代码所做的。我不知道如何计算回归的概率预测。