PySpark 训练随机森林管道答案

【问题标题】：PySpark Training A Random Forest PipelinePySpark 训练随机森林管道
【发布时间】：2018-11-17 09:16:01
【问题描述】：

我正在尝试创建一个管道，该管道获取我的航班延误信息 DataFrame 并在其上运行一个随机森林。我对 MLLib 很陌生，无法弄清楚我在下面的代码中哪里出错了。

我的 DataFrame 是从这种格式的 parquet 文件中读入的：

Table before Encoding
+-----+-----+---+---+----+--------+-------+------+----+-----+-------+
|Delay|Month|Day|Dow|Hour|Distance|Carrier|Origin|Dest|HDays|Delayed|
+-----+-----+---+---+----+--------+-------+------+----+-----+-------+
|   -8|    8|  4|  2|  11|     224|     OO|   GEG| SEA|   31|      0|
|  -12|    8|  5|  3|  11|     224|     OO|   GEG| SEA|   32|      0|
|   -9|    8|  6|  4|  11|     224|     OO|   GEG| SEA|   32|      0|
+-----+-----+---+---+----+--------+-------+------+----+-----+-------+
only showing top 3 rows

然后我继续对分类列进行 OneHotEncode，并将所有特征组合到 Features 列中（Delayed 是我想要预测的）。这是代码：

import os
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier

spark = SparkSession.builder \
    .master('local[3]') \
    .appName('Flight Delay') \
    .getOrCreate()

# read in the pre-processed DataFrame from the parquet file
base_dir = '/home/william/Projects/flight-delay/data/parquet'
flights_df = spark.read.parquet(os.path.join(base_dir, 'flights.parquet'))

print('Table before Encoding')
flights_df.show(3)

# categorical columns that will be OneHotEncoded
cat_cols = ['Month', 'Day', 'Dow', 'Hour', 'Carrier', 'Dest']

# numeric columns that will be a part of features used for prediction
non_cat_cols = ['Delay', 'Distance', 'HDays']

# NOTE: StringIndexer does not have multiple col support yet (PR #9183 )
# Create StringIndexer for each categorical feature
cat_indexers = [ StringIndexer(inputCol=col, outputCol=col+'_Index')
                 for col in cat_cols ]

# OneHotEncode each categorical feature after being StringIndexed
encoders = [ OneHotEncoder(dropLast=False, inputCol=indexer.getOutputCol(),
                           outputCol=indexer.getOutputCol()+'_Encoded')
             for indexer in cat_indexers ]

# Assemble all feature columns (numeric + categorical) into `features` col
assembler = VectorAssembler(inputCols=[encoder.getOutputCol()
                                       for encoder in encoders] + non_cat_cols,
                            outputCol='Features')

# Train a random forest model
rf = RandomForestClassifier(labelCol='Delayed',featuresCol='Features', numTrees=10)

# Chain indexers, encoders, and forest into one pipeline
pipeline = Pipeline(stages=[ *cat_indexers, *encoders, assembler, rf ] )

# split the data into training and testing splits (70/30 rn)
(trainingData, testData) = flights_df.randomSplit([0.7, 0.3])

# Train the model -- which also runs indexers and coders
model = pipeline.fit(trainingData)

# use model to make predictions
precitions = model.trainsform(testData)

predictions.show(10)

当我运行它时，我得到一个 Py4JJavaError: An error occurred while calling o46.fit. : java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double

非常感谢任何帮助！

【问题讨论】：

标签： python apache-spark pyspark apache-spark-mllib

【解决方案1】：

正如in the comments 解释的那样，标签应该是double 所以你必须投：

flights_df = spark.read.parquet(os.path.join(base_dir, 'flights.parquet')) \
    .withColumn("Delayed", col("Delayed").cast("double"))

【讨论】：

确实（虽然找不到您所指的 cmets）。 Spark ML/MLlib 还有其他一些类似烦人且未记录的特性 - 请参见此处：nodalpoint.com/spark-classification