在管道中每次迭代访问第 K 次折叠答案

【问题标题】：Accessing Kth fold per iteration in pipeline在管道中每次迭代访问第 K 次折叠
【发布时间】：2022-01-10 16:05:06
【问题描述】：

我正在开发一个同时包含转换和分类的管道。但是，我使用的转换函数是一个自定义函数，需要了解任何给定迭代（Xtrain、Xtest）的训练/测试拆分。

我想使用 FunctionTransformer，因为我相信这是我需要的。

def normalize1(data, mean, std):
   df = pd.DataFrame(data=data)

   if mean is None and std is None:
       mean = df.mean(axis=0)
       std = df.std(axis=0)
       normalizedDf = (df - mean)/std
       return normalizedDf.values, mean, std

   normalizedDf = (df - mean)/std
   return normalizedDf.values

从那里，我定义了以下管道：

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
clf = make_pipeline(FunctionTransformer(normalize1), 
                GridSearchCV(SVC(),
                             param_grid=paramGrid,
                             cv=cv,
                             refit=True))

这显然会返回以下错误：normalize1() missing 3 required positional arguments: 'data', 'mean', and 'std'

我希望我的管道在给定的迭代中获得训练/测试拆分，应用我的标准化函数并将新的 X_train 和 X_test 值传递给 GridSearch。有什么办法可以做到吗？

仅供参考：归一化函数可以这样解释：

（变量 - 训练中的平均值）/std 训练中

这就是函数计算训练队列的均值和标准值 if mean is None and std is None 并在具有给定均值和标准值的测试集上应用相同归一化的原因。

【问题讨论】：

标签： python scikit-learn

【解决方案1】：

如果您的意图是“在任何给定迭代中需要了解训练/测试拆分的自定义函数”，那么您需要将转换器放置在 GridSearchCV 中。您当前的代码首先转换训练数据，然后将其传递给GridSearchCV。

所以首先你不需要返回mean 和std：

def normalize1(data, mean, std):
    df = pd.DataFrame(data=data)

    if mean is None and std is None:
        mean = df.mean(axis=0)
        std = df.std(axis=0)
        normalizedDf = (df - mean)/std
        return normalizedDf.values

    normalizedDf = (df - mean)/std
    return normalizedDf.values

然后你在FunctionTransformer 中为你的函数提供参数：

pipeline = make_pipeline(
FunctionTransformer(normalize1,kw_args={'mean':None,'std':None}), 
SVC())
clf = GridSearchCV(pipeline, param_grid={}, cv=cv, refit=True)

我们可以适应它：

from sklearn.datasets import make_classification
X,y = make_classification()

clf.fit(X,y)

你可以尝试设置默认

def normalize1(data, mean = None , std = None):
    df = pd.DataFrame(data=data)

    if mean is None and std is None:
        mean = df.mean(axis=0)
        std = df.std(axis=0)
        normalizedDf = (df - mean)/std
        return normalizedDf.values
    
    normalizedDf = (df - mean)/std
    return normalizedDf.values

pipeline = make_pipeline(FunctionTransformer(normalize1),SVC())
clf = GridSearchCV(clf,param_grid={},cv=cv,refit=True)
clf.fit(X,y)

【讨论】：