如何使用 sklearn 管道执行并行和串行转换？答案

【问题标题】：How to execute both parallel and serial transformations with sklearn pipeline?如何使用 sklearn 管道执行并行和串行转换？
【发布时间】：2022-01-17 17:20:31
【问题描述】：

我想使用 sklearn 的管道执行一些像这张图这样的预处理。

如果我放弃标准化步骤，我可以毫无问题地做到这一点。但我不明白如何表明插补步骤的输出应该流向标准化步骤。

这是没有标准化步骤的当前代码：

preprocessor = ColumnTransformer(
    transformers=[
        ("numeric_imputation", NumericImputation(), dq.numeric_variables),
        ("onehot", OneHotEncoder(handle_unknown="ignore"), dq.categorical_variables),
    ],
    remainder="passthrough",
)

bp2 = make_pipeline(
    preprocessor, ElasticNet()
)

【问题讨论】：

标签： python scikit-learn

【解决方案1】：

事实上，ColumnTransformer 将其转换器并行应用于您传递给它的数据集。因此，如果您将标准化数字数据的转换器添加为转换器列表中的第二步，这将不适用于插补的输出，而是适用于初始数据集。

解决此类问题的一种可能性是将数字列上的转换包含在 Pipeline 中。

preprocessor = ColumnTransformer([
    ('num_pipe', Pipeline([('numeric_imputation', NumericImputation()),
                           ('standardizer', YourStandardizer())]), dq.numeric_variables),
    ('onehot', OneHotEncoder(handle_unknown="ignore"), dq.categorical_variables)],
remainder = 'passthrough')

我建议您就类似主题发表以下帖子：

（您会发现其中的一些其他链接）。

【讨论】：