使用 Pyspark 进行交叉验证答案

【问题标题】：Cross validation using Pyspark使用 Pyspark 进行交叉验证
【发布时间】：2021-01-19 04:47:27
【问题描述】：

我正在尝试在使用 spark 时使用交叉验证，但它会引发错误：

gbtClassifier = GBTClassifier(featuresCol= "features", labelCol="is_goal")
lr = LogisticRegression(featuresCol= "features" ,labelCol="is_goal")
pipelineStages = stringIndexers + encoders + [featureAssembler]
pipeline = Pipeline(stages=pipelineStages)

param_grid_lr = ParamGridBuilder().addGrid(lr.regParam, [0.1,0.01]).addGrid(lr.elasticNetParam, [0,0.5,1]).build()

crossval = CrossValidator(estimator=lr, estimatorParamMaps=param_grid_lr ,evaluator=BinaryClassificationEvaluator(), numFolds=3)

cross_model = crossval.fit(df_tr)

IllegalArgumentException：标签不存在。 Available: event_type_str, event_team, shot_place_str, location_str, assist_method_str, situation_str, country_code, is_goal, event_type_str_idx, event_team_idx, shot_place_str_idx, location_str_idx, assist_method_str_idx, situation_str_idx, country_code_idx, event_type_str_vec, event_team_vec, shot_place_str_vec, location_str_vec, assist_method_str_vec, situation_str_vec, country_code_vec, features, CrossValidator_2fc516202d9d_rand, rawPrediction，概率，预测

[这是我的特征的样子1

【问题讨论】：

标签： apache-spark pyspark

【解决方案1】：

您的 BinaryClassificationEvaluator 默认情况下期望标签列称为 label ，您可以从文档 https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator 中看到。您需要根据数据框中给出的列指定rawPredictionCol 和labelCol

【讨论】：

谢谢！它起作用了，我将“is_goal”列重命名为标签，它起作用了