【问题标题】:Spark access estimator in pipeline管道中的 Spark 访问估计器
【发布时间】:2016-11-12 16:57:49
【问题描述】:

类似于Is it possible to access estimator attributes in spark.ml pipelines? 我想访问估算器,例如管道中的最后一个元素。

那里提到的方法似乎不再适用于 spark 2.0.1。现在怎么样了?

编辑

也许我应该更详细地解释一下: 这是我的估计器+向量汇编器:

val numRound = 20
val numWorkers = 4
val xgbBaseParams = Map(
    "max_depth" -> 10,
    "eta" -> 0.1,
    "seed" -> 50,
    "silent" -> 1,
    "objective" -> "binary:logistic"
  )

val xgbEstimator = new XGBoostEstimator(xgbBaseParams)
    .setFeaturesCol("features")
    .setLabelCol("label")

val vectorAssembler = new VectorAssembler()
    .setInputCols(train.columns
      .filter(!_.contains("label")))
    .setOutputCol("features")

  val simplePipeParams = new ParamGridBuilder()
    .addGrid(xgbEstimator.round, Array(numRound))
    .addGrid(xgbEstimator.nWorkers, Array(numWorkers))
    .build()

   val simplPipe = new Pipeline()
    .setStages(Array(vectorAssembler, xgbEstimator))

  val numberOfFolds = 2
  val cv = new CrossValidator()
    .setEstimator(simplPipe)
    .setEvaluator(new BinaryClassificationEvaluator()
      .setLabelCol("label")
      .setRawPredictionCol("prediction"))
    .setEstimatorParamMaps(simplePipeParams)
    .setNumFolds(numberOfFolds)
    .setSeed(gSeed)

  val cvModel = cv.fit(train)
  val trainPerformance = cvModel.transform(train)
  val testPerformance = cvModel.transform(test)

现在我想执行自定义评分,例如!= 0.5 截止点。如果我掌握了模型,这是可能的:

val realModel = cvModel.bestModel.asInstanceOf[XGBoostClassificationModel]

但是这里的这一步不能编译。 感谢您的建议,我可以获得模型:

 val pipelineModel: Option[PipelineModel] = cvModel.bestModel match {
    case p: PipelineModel => Some(p)
    case _ => None
  }

  val realModel: Option[XGBoostClassificationModel] = pipelineModel
    .flatMap {
      _.stages.collect { case t: XGBoostClassificationModel => t }
        .headOption
    }
  // TODO write it nicer
  val measureResults = realModel.map {
    rm =>
      {
        for (
          thresholds <- Array(Array(0.2, 0.8), Array(0.3, 0.7), Array(0.4, 0.6),
            Array(0.6, 0.4), Array(0.7, 0.3), Array(0.8, 0.2))
        ) {
          rm.setThresholds(thresholds)

          val predResult = rm.transform(test)
            .select("label", "probabilities", "prediction")
            .as[LabelledEvaluation]
          println("cutoff was ", thresholds)
          calculateEvaluation(R, predResult)
        }
      }
  }

但是问题是

val predResult = rm.transform(test)

将失败,因为train 不包含vectorAssembler 的特征列。 此列仅在运行完整管道时创建。

所以我决定创建第二个管道:

val scoringPipe = new Pipeline()
            .setStages(Array(vectorAssembler, rm))
val predResult = scoringPipe.fit(train).transform(test)

但这似乎有点笨拙。你有更好/更好的主意吗?

【问题讨论】:

标签: apache-spark pipeline


【解决方案1】:

Spark 2.0.0 没有任何变化,同样的方法也有效。 Example Pipeline:

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row

// Prepare training documents from a list of (id, text, label) tuples.
val training = spark.createDataFrame(Seq(
  (0L, "a b c d e spark", 1.0),
  (1L, "b d", 0.0),
  (2L, "spark f g h", 1.0),
  (3L, "hadoop mapreduce", 0.0)
)).toDF("id", "text", "label")

// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
  .setInputCol("text")
  .setOutputCol("words")
val hashingTF = new HashingTF()
  .setNumFeatures(1000)
  .setInputCol(tokenizer.getOutputCol)
  .setOutputCol("features")
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.01)
val pipeline = new Pipeline()
  .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.
val model = pipeline.fit(training)

和型号:

val logRegModel = model.stages.last
  .asInstanceOf[org.apache.spark.ml.classification.LogisticRegressionModel]

【讨论】:

  • 女士的问题是我想在交叉验证中使用管道,例如估计器嵌套两次。 cvModel.bestModel.getStages 不起作用。那么我将如何获得 Crossvalidator 的管道呢?
猜你喜欢
  • 2016-01-18
  • 2017-03-25
  • 2021-07-19
  • 2019-12-04
  • 2020-03-10
  • 1970-01-01
  • 2020-05-10
  • 2020-09-11
  • 2017-06-13
相关资源
最近更新 更多