【问题标题】:pyspark-2.3 sparkml LogisticRegression model load issuepyspark-2.3 spark ml 逻辑回归模型加载问题
【发布时间】:2018-12-12 17:46:06
【问题描述】:

我正在做一个示例 pyspark ml 练习,我需要在其中存储模型并将其读回。我能够成功保存模型,但是当我尝试读取/加载它时,它会抛出异常。我是 spark ml 和 python 的新手,请指导我。

代码:

from pyspark.sql import *
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import TrainValidationSplit
from pyspark.ml.tuning import TrainValidationSplitModel


spark = SparkSession.builder.appName("LocalMLSparkSession").master("local").getOrCreate()

df = spark.read.json("/to_data/simpleml.json").orderBy("value2")

df.select(df.color).distinct().show(10, False)

train, test = df.randomSplit([0.7, 0.3])


rForm = RFormula()
ls = LogisticRegression().setLabelCol("label").setFeaturesCol("features")

# setting pipeline
stages = [rForm,ls]
pipeline = Pipeline().setStages(stages)


#setting param grid builder
params = ParamGridBuilder()\
     .addGrid(rForm.formula,["lab ~ . + color:value1", "lab ~ . + color:value1 + color:value2"])\
     .addGrid(ls.elasticNetParam, [0.0, 0.5, 1.0])\
     .addGrid(ls.regParam,[0.1, 0.2])\
     .build()

#setting evaluator
evaluator = BinaryClassificationEvaluator()\
            .setMetricName("areaUnderROC")\
            .setRawPredictionCol("prediction")\
            .setLabelCol("label")

#checking hyperparameters to train datasets
tvs = TrainValidationSplit()\
    .setTrainRatio(0.75)\
    .setEstimatorParamMaps(params)\
    .setEstimator(pipeline)\
    .setEvaluator(evaluator)


tvsFitted = tvs.fit(train)

evl = evaluator.evaluate(tvsFitted.transform(test))

tvsFitted.transform(test).select("features", "label", "prediction").show(10,False)

print(evl)
pip_model = tvsFitted.bestModel
pip_model.write().overwrite().save("/to_path/sparkml/model")

model = TrainValidationSplitModel().load("/to_path/sparkml/model")
model.transform(test)

例外:

Traceback (most recent call last):
  File "/home/dd/dd/python-workspace/SparkMLPipelineDemo.py", line 59, in <module>
    model = TrainValidationSplitModel().load("/to_path/sparkml/model")
TypeError: __init__() missing 1 required positional argument: 'bestModel'

Process finished with exit code 1

【问题讨论】:

    标签: python apache-spark apache-spark-ml


    【解决方案1】:

    加载时需要去掉括号,例如替换:

    model = TrainValidationSplitModel().load("/to_path/sparkml/model")

    model = TrainValidationSplitModel.load("/to_path/sparkml/model")

    【讨论】:

    • 我尝试使用您提到的代码更改来获取 pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.tuning.TrainValidationSplitModel but found class名称 org.apache.spark.ml.PipelineModel'
    • 哦,没看到这个。您正在尝试使用 PipelineModel 对象加载 TrainValidationSplitModel。请改用 PipelineModel.load("/to_path/sparkml/model")
    • 我试过了 .. 得到这个错误 ... model = PipelineModel().load("/to_data/sparkml/model") TypeError: __init__() missing 1 required positional argument: 'stages'
    猜你喜欢
    • 2018-05-20
    • 1970-01-01
    • 2018-05-21
    • 2016-09-26
    • 1970-01-01
    • 2016-09-13
    • 1970-01-01
    • 2015-04-18
    • 2016-03-24
    相关资源
    最近更新 更多