如何将管道中的预处理数据转换为数据帧？答案

【问题标题】：How do I turn preprocessed data from pipelines into dataframes?如何将管道中的预处理数据转换为数据帧？
【发布时间】：2021-12-16 12:58:14
【问题描述】：

我有一段代码是我的数据的预处理文件。一切都是洁净的，直到我必须将预处理的数据输入到采用 pandas 数据帧和数组的 fit 函数中。如何将这些训练数据转换为数据框以供喂养？从 pipeline.fit() 函数开始，数据类型是列转换器，而不是 pandas df。

代码：

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# generate the data
data = pd.DataFrame({
    'y':  [1, 2, 3, 4, 5],
    'x1': [6, 7, 8, np.nan, np.nan],
    'x2': [9, 10, 11, np.nan, np.nan],
    'x3': ['a', 'b', 'c', np.nan, np.nan],
    'x4': [np.nan, np.nan, 'd', 'e', 'f']
})

# extract the features and target
x = data.drop(labels=['y'], axis=1)
y = data['y']

# split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)

# map the features to the corresponding types (numerical or categorical)
numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = x_train.select_dtypes(include=['object']).columns.tolist()

# define the numerical features pipeline
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# define the categorical features pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# define the overall pipeline
preprocessor_pipeline = ColumnTransformer(transformers=[
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)
])

# fit the pipeline to the training data
preprocessor_pipeline.fit(x_train)

# apply the pipeline to the training and test data
x_train_ = preprocessor_pipeline.transform(x_train)
x_test_ = preprocessor_pipeline.transform(x_test)

奖励：我是否也需要预处理我的标签 (y_train)？

【问题讨论】：

标签： python pandas dataframe scikit-learn

【解决方案1】：

要将您的管道结果转换为数据帧，您只需要这样做：

x_train_df = pd.DataFrame(data=x_train_)
x_test_df = pd.DataFrame(data=x_test_)

由于您的标签 y 在大多数情况下已经是数字，因此不需要进一步的预处理。但这也取决于您要在下一步中使用的 ML 模型。

【讨论】：

当我这样做时，我得到： raise ValueError("DataFrame constructor not proper called!") ValueError: DataFrame constructor not proper called!
@Luleo_Primoc 无法重现您报告的错误；建议的代码对我来说没问题