Scala：如何修改交叉验证的默认指标答案

【问题标题】：Scala: how to modify the default metric for cross validationScala：如何修改交叉验证的默认指标
【发布时间】：2019-05-15 14:31:31
【问题描述】：

我在这个网站上找到了下面的代码： https://spark.apache.org/docs/2.3.1/ml-tuning.html

// Note that the evaluator here is a BinaryClassificationEvaluator and its default metric
// is areaUnderROC.
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new BinaryClassificationEvaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(2)  // Use 3+ in practice
  .setParallelism(2)  // Evaluate up to 2 parameter settings in parallel

正如他们所说，BinaryClassificationEvaluator 的默认指标是“AUC”。如何将此默认指标更改为 F1 分数？

我试过了：

// Note that the evaluator here is a BinaryClassificationEvaluator and its default metric
// is areaUnderROC.
val cv = new CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(new BinaryClassificationEvaluator.setMetricName("f1"))
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(2)  // Use 3+ in practice
  .setParallelism(2)  // Evaluate up to 2 parameter settings in parallel

但是我遇到了一些错误... 我搜索了很多网站，但我没有找到解决方案...

【问题讨论】：

标签： scala apache-spark cross-validation metrics evaluator

【解决方案1】：

setMetricName 只接受“areaUnderPR”或“areaUnderROC”。你需要自己写Evaluator；像这样：

import org.apache.spark.ml.evaluation.Evaluator
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.param.shared.{HasLabelCol, HasPredictionCol}
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.{Dataset, functions => F}

class FScoreEvaluator(override val uid: String) extends Evaluator with HasPredictionCol with HasLabelCol{

  def this() = this(Identifiable.randomUID("FScoreEvaluator"))

  def evaluate(dataset: Dataset[_]): Double = {
    val truePositive = F.sum(((F.col(getLabelCol) === 1) && (F.col(getPredictionCol) === 1)).cast(IntegerType))
    val predictedPositive = F.sum((F.col(getPredictionCol) === 1).cast(IntegerType))
    val actualPositive = F.sum((F.col(getLabelCol) === 1).cast(IntegerType))

    val precision = truePositive / predictedPositive
    val recall = truePositive / actualPositive
    val fScore = F.lit(2) * (precision * recall) / (precision + recall)

    dataset.select(fScore).collect()(0)(0).asInstanceOf[Double]
  }

  override def copy(extra: ParamMap): Evaluator = defaultCopy(extra)
}

【讨论】：

感谢您的回答。但是，这是行不通的。是 scala 还是 pyspark ......？运行您的代码时出现一些错误：“错误：未找到：类型评估器”、“错误：未找到：类型 HasPredictionCol”、“错误：未找到：值 F”....
@Anneso 这是 Scala。您是否运行了import 语句？另外，您使用的是什么版本的 Spark？听起来像
我使用的是 Spark 的 2.1.1.2.6.1.0-129 版本和 scala 的 2.11.8 版本。是的，我确实运行了导入语句，并且在执行此操作时没有收到任何错误...
是的，Spark
我不确定是否理解您的最后评论。对我的 Spark 版本有什么想法吗？

【解决方案2】：

基于@gmds 的回答。确保 Spark 版本 >=2.3。

您也可以关注the implementation of RegressionEvaluator in Spark 来实现其他自定义评估器。

我还添加了isLargerBetter，以便实例化的评估器可以用于模型选择（例如CV）

import org.apache.spark.ml.evaluation.Evaluator
import org.apache.spark.ml.param.ParamMap
import org.apache.spark.ml.param.shared.{HasLabelCol, HasPredictionCol, HasWeightCol}
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.IntegerType
import org.apache.spark.sql.{Dataset, functions => F}

class WRmseEvaluator(override val uid: String) extends Evaluator with HasPredictionCol with HasLabelCol with HasWeightCol {

    def this() = this(Identifiable.randomUID("wrmseEval"))

    def setPredictionCol(value: String): this.type = set(predictionCol, value)
    
    def setLabelCol(value: String): this.type = set(labelCol, value)
    
    def setWeightCol(value: String): this.type = set(weightCol, value)
    
    def evaluate(dataset: Dataset[_]): Double = {
        dataset
            .withColumn("residual", F.col(getLabelCol) - F.col(getPredictionCol))
            .select(
                F.sqrt(F.sum(F.col(getWeightCol) * F.pow(F.col("residual"), 2)) / F.sum(getWeightCol))
            )
            .collect()(0)(0).asInstanceOf[Double]

    }

    override def copy(extra: ParamMap): Evaluator = defaultCopy(extra)

    override def isLargerBetter: Boolean = false
}

以下是它的使用方法。

val wrmseEvaluator = new WRmseEvaluator()
    .setLabelCol(labelColName)
    .setPredictionCol(predColName)
    .setWeightCol(weightColName)

【讨论】：