Pyspark 中对随机森林的错误评估答案

【问题标题】：Incorrect evaluations on Random Forest in PysparkPyspark 中对随机森林的错误评估
【发布时间】：2018-04-21 23:19:07
【问题描述】：

我正在使用逻辑回归和随机森林对电信流失数据集进行预测。

请在这里找到我笔记本中的代码 sn-p：

data=spark.read.csv("D:\Shashank\CBA\Pyspark\Telecom_Churn_Data_SingTel.csv", header=True, inferSchema=True)
data.show(3)

This link is to show the kind of data i am dealing with on a high level

data=data.drop("State").drop("Area Code").drop("Phone Number")
from pyspark.ml.feature import StringIndexer, VectorAssembler
intlPlanIndex = StringIndexer(inputCol="International Plan", outputCol="International Plan Index")
voiceMailPlanIndex = StringIndexer(inputCol="Voice mail Plan", outputCol="Voice mail Plan Index")
churnIndex = StringIndexer(inputCol="Churn", outputCol="label")
othercols=["Account Length", "Num of Voice mail Messages","Total Day Minutes", "Total Day Calls", "Total day Charge","Total Eve Minutes","Total Eve Calls","Total Eve Charge","Total Night Minutes","Total Night Calls ","Total Night Charge","Total International Minutes","Total Intl  Calls","Total Intl Charge","Number Customer Service calls "]
assembler = VectorAssembler(inputCols= ['International Plan Index'] + ['Voice mail Plan Index'] + othercols, outputCol="features")
(train, test) = data.randomSplit([0.8,0.2])
from pyspark.ml.classification import LogisticRegression
lrObj = LogisticRegression(labelCol='label', featuresCol='features')
from pyspark.ml.pipeline import Pipeline
pipeline = Pipeline(stages=[intlPlanIndex, voiceMailPlanIndex, churnIndex, assembler, lrObj])
lrModel = pipeline.fit(train)
prediction_train = lrModel.transform(train)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
lr_Evaluator = MulticlassClassificationEvaluator()
lr_Evaluator.evaluate(prediction_train)

This image shows the result on evaluation using logistic Regression

然后我使用 随机森林分类 模型重复相同的操作：我评估为 94.4% 我的结果是这样的： Link to my Random Forest evaluation result

到目前为止，一切看起来都还不错。但是我很想知道事情实际上是如何被预测的，所以我使用下面的代码打印了我的预测值：

selected = prediction_1.select("features", "Label", "Churn", "prediction")
for row in selected.collect():
    print(row)

我得到的结果有点像下面的截图： Link to image that shows the 2 results printed out for manual analysis

然后，我将上面链接中显示的两个单元格复制到压缩器中，以查看我的预测值是否不同。（我预计会有一些差异，因为随机森林的评估结果更好）

但任何工具的比较都表明预测是相同的。然而，评估结果显示，LogisticRegression 和 RandomForest 的差异为 83.6%。

当使用 MuticlassClassificationEvaluator 的最终评估给我不同的概率时，为什么我从 2 个不同模型生成的 2 组数据没有差异？

【问题讨论】：

您应该解释您指向的链接是什么。诸如“这是结果的图像”或“您可以在下面的链接中找到运行代码”之类的内容。
嗨@Jeremie，这是我第一次在堆栈上，我为缺乏信息道歉。我已经编辑了我的帖子，希望这有助于澄清我的问题。
欢迎！不用道歉！这对我们开发者来说是最好的！

标签： pyspark random-forest pyspark-sql

【解决方案1】：

你似乎对metricName="accuracy"感兴趣

predictions = model.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

更多信息请参考official documentation。

【讨论】：

嗨@Prem，从我有限的理解来看，我认为我并不担心这个指标，而是对两个分类模型的预测数量之间缺乏差异感到好奇：selected = prediction_1。 select("features", "Label", "Churn", "prediction") for row in selected.collect(): print(row)
@Shashank 你能运行建议的代码，然后确认两个模型的accuracy 吗？
抱歉，数据存在差异。错误是我的，我没有从 Jupyter 复制单元格中的所有数据。
没问题 :) 我建议不要删除这个问题，因为它可能对寻求 LR/RF 实施的人有所帮助。

【解决方案2】：

这个问题不再相关，因为我能够看到预测的差异，这与每个模型下预测的准确性一致。问题出现是因为我从 Jupyter notebook 复制的数据不完整。

感谢您的宝贵时间。

【讨论】：