【发布时间】:2018-02-16 12:04:02
【问题描述】:
我正在使用线性回归来拟合模型的数据集。在签署之前,我想尝试使用超参数调整来获得最佳模型。
我一直在通过管道运行数据,首先将字符串转换为数字,然后对其进行编码,然后对所有列进行矢量化,然后在应用线性回归之前对其进行缩放。我很想知道如何设置 grid 来启动超参数滚球(可以这么说)。
import pyspark.ml.feature as ft
WD_indexer = ft.StringIndexer(inputCol="Wind_Direction", outputCol="WD-num")
WD_encoder = ft.OneHotEncoder(inputCol="WD-num", outputCol='WD-vec')
featuresCreator = ft.VectorAssembler(inputCols=["Dew_Point", "Temperature",
"Pressure", "WD-vec", "Wind_Speed","Hours_Snow","Hours_Rain"], outputCol='features')
from pyspark.ml.feature import StandardScaler
feature_scaler = StandardScaler(inputCol="features",outputCol="sfeatures")
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="sfeatures",labelCol="PM_Reading")
所以管道看起来像这样:
from pyspark.ml import Pipeline
pipeline = Pipeline( stages = [WD_indexer, WD_encoder, featuresCreator, feature_scaler, lr] )
如何设置此管道的网格?
谢谢
【问题讨论】:
标签: apache-spark pyspark apache-spark-mllib