【发布时间】:2018-04-21 23:19:07
【问题描述】:
我正在使用逻辑回归和随机森林对电信流失数据集进行预测。
请在这里找到我笔记本中的代码 sn-p:
data=spark.read.csv("D:\Shashank\CBA\Pyspark\Telecom_Churn_Data_SingTel.csv", header=True, inferSchema=True)
data.show(3)
This link is to show the kind of data i am dealing with on a high level
data=data.drop("State").drop("Area Code").drop("Phone Number")
from pyspark.ml.feature import StringIndexer, VectorAssembler
intlPlanIndex = StringIndexer(inputCol="International Plan", outputCol="International Plan Index")
voiceMailPlanIndex = StringIndexer(inputCol="Voice mail Plan", outputCol="Voice mail Plan Index")
churnIndex = StringIndexer(inputCol="Churn", outputCol="label")
othercols=["Account Length", "Num of Voice mail Messages","Total Day Minutes", "Total Day Calls", "Total day Charge","Total Eve Minutes","Total Eve Calls","Total Eve Charge","Total Night Minutes","Total Night Calls ","Total Night Charge","Total International Minutes","Total Intl Calls","Total Intl Charge","Number Customer Service calls "]
assembler = VectorAssembler(inputCols= ['International Plan Index'] + ['Voice mail Plan Index'] + othercols, outputCol="features")
(train, test) = data.randomSplit([0.8,0.2])
from pyspark.ml.classification import LogisticRegression
lrObj = LogisticRegression(labelCol='label', featuresCol='features')
from pyspark.ml.pipeline import Pipeline
pipeline = Pipeline(stages=[intlPlanIndex, voiceMailPlanIndex, churnIndex, assembler, lrObj])
lrModel = pipeline.fit(train)
prediction_train = lrModel.transform(train)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
lr_Evaluator = MulticlassClassificationEvaluator()
lr_Evaluator.evaluate(prediction_train)
This image shows the result on evaluation using logistic Regression
然后我使用 随机森林分类 模型重复相同的操作: 我评估为 94.4% 我的结果是这样的: Link to my Random Forest evaluation result
到目前为止,一切看起来都还不错。 但是我很想知道事情实际上是如何被预测的,所以我使用下面的代码打印了我的预测值:
selected = prediction_1.select("features", "Label", "Churn", "prediction")
for row in selected.collect():
print(row)
我得到的结果有点像下面的截图: Link to image that shows the 2 results printed out for manual analysis
然后,我将上面链接中显示的两个单元格复制到压缩器中,以查看我的预测值是否不同。 (我预计会有一些差异,因为随机森林的评估结果更好)
但任何工具的比较都表明预测是相同的。然而,评估结果显示,LogisticRegression 和 RandomForest 的差异为 83.6%。
当使用 MuticlassClassificationEvaluator 的最终评估给我不同的概率时,为什么我从 2 个不同模型生成的 2 组数据没有差异?
【问题讨论】:
-
您应该解释您指向的链接是什么。诸如“这是结果的图像”或“您可以在下面的链接中找到运行代码”之类的内容。
-
嗨@Jeremie,这是我第一次在堆栈上,我为缺乏信息道歉。我已经编辑了我的帖子,希望这有助于澄清我的问题。
-
欢迎!不用道歉!这对我们开发者来说是最好的!
标签: pyspark random-forest pyspark-sql