使用列操作步骤创建 sklearn 管道答案

【问题标题】：Create sklearn pipeline with column operations step使用列操作步骤创建 sklearn 管道
【发布时间】：2021-11-10 07:25:49
【问题描述】：

我想知道如何在 sklearn 管道中插入一个步骤，该步骤将两列值相乘并删除原始值。

我正在做类似的事情。

加载数据框后，我将目标列相乘并删除。
准备 X、Y、训练集和测试集。
使用 StandardScaler 和一些 ML 方法（例如线性回归）配置管道
拟合和预测。

import pandas as pd, numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline


# df is a pandas dataframe with columns A, B, C, Y
df['BC']=df['B']*te['C']
df.drop(columns=['B','C'], inplace=True)

X = df.loc[:,['A','BC']]
Y = df['Y']

x_train, x_test, y_train, y_test = train_test_split(X,Y,train_size=0.8)

pipe = Pipeline([
    ('minmax',StandardScaler()),
    ('linear',LinearRegression())
])

pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test)

使用这种方法，当我想对新数据进行一些预测时，我必须通过乘法，例如 A=1, B=3, C=4

print(pipe.predict(np.array([[1,12]])))

我想要一个类似的方法

print(pipe.predict(np.array([[1,3,4]])))

我想要的是为类似的东西修改管道

pipe = Pipeline([
    ('product', CustomFunction(columns_to_multiply, result_name_column)),
    ('minmax',StandardScaler()),
    ('linear',LinearRegression())
])

scikit-learn 或自定义函数是否可行？怎么样？

【问题讨论】：

标签： python pandas dataframe scikit-learn pipeline

【解决方案1】：

由于缺少数据，我无法全面测试您的代码。但是，您可以采用FunctionTransfomer，如下所示：

代码：

def CustomMultiplier(arrs):
    a = arrs[:,0]
    b = np.prod(arrs[:,1:], axis=1)
    return np.column_stack((a, b))

if __name__ == '__main__':
    transformer = FunctionTransformer(CustomMultiplier)
    X = np.array([[1,3,4], [2,4,5]])
    result = transformer.transform(X)
    print(result)

结果：

[[ 1 12]
 [ 2 20]]

【讨论】：

感谢您的回答！这不完全是我打算做的，但它启发了我。后来，我发现这篇文章对我的目标有所帮助：towardsdatascience.com/…