如何在 scikit-learn 的“管道”中使用自定义特征选择功能答案

【问题标题】：How can I use a custom feature selection function in scikit-learn's `pipeline`如何在 scikit-learn 的“管道”中使用自定义特征选择功能
【发布时间】：2021-06-08 15:23:26
【问题描述】：

假设我想通过交叉验证和使用pipeline 类来比较包含 n>2 个特征的特定（监督）数据集的不同降维方法。

例如，如果我想试验 PCA 和 LDA，我可以这样做：

from sklearn.cross_validation import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import StandardScaler
from sklearn.lda import LDA
from sklearn.decomposition import PCA

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),           
    ('classification', GaussianNB())   
    ])

clf_pca = Pipeline(steps=[
    ('scaler', StandardScaler()),    
    ('reduce_dim', PCA(n_components=2)),
    ('classification', GaussianNB())   
    ])

clf_lda = Pipeline(steps=[
    ('scaler', StandardScaler()), 
    ('reduce_dim', LDA(n_components=2)),
    ('classification', GaussianNB())   
    ])

# Constructing the k-fold cross validation iterator (k=10)  

cv = KFold(n=X_train.shape[0],  # total number of samples
           n_folds=10,           # number of folds the dataset is divided into
           shuffle=True,
           random_state=123)

scores = [
    cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
            for clf in [clf_all, clf_pca, clf_lda]
    ]

但是现在，让我们说——基于一些“领域知识”——我假设特征 3 和 4 可能是“好的特征”（数组 X_train 的第三和第四列）和我想将它们与其他方法进行比较。

如何在pipeline 中包含这样的手动功能选择？

例如

def select_3_and_4(X_train):
    return X_train[:,2:4]

clf_all = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('feature_select', select_3_and_4),           
    ('classification', GaussianNB())   
    ])

显然行不通。

所以我假设我必须创建一个具有transform 虚拟方法和fit 方法的特征选择类，该方法返回numpy 数组的两列？还是有更好的办法？

【问题讨论】：

我知道这是一个老帖子，但是对于任何看到这个的人来说，他们应该注意到 LDA 是一个分类器，而不是一个转换器，所以它在这个例子中的使用是不合适的。跨度>

标签： python scikit-learn

【解决方案1】：

我只是想发布我的完整解决方案，也许它对其中一个有用：

class ColumnExtractor(object):

    def transform(self, X):
        cols = X[:,2:4] # column 3 and 4 are "extracted"
        return cols

    def fit(self, X, y=None):
        return self

然后，它可以像这样在Pipeline中使用：

clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('reduce_dim', ColumnExtractor()),           
    ('classification', GaussianNB())   
    ])

编辑：一般解决方案

而对于更通用的解决方案，如果要选择并堆叠多列，基本上可以使用以下类：

import numpy as np

class ColumnExtractor(object):

    def __init__(self, cols):
        self.cols = cols

    def transform(self, X):
        col_list = []
        for c in self.cols:
            col_list.append(X[:, c:c+1])
        return np.concatenate(col_list, axis=1)

    def fit(self, X, y=None):
        return self

    clf = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('dim_red', ColumnExtractor(cols=(1,3))),   # selects the second and 4th column      
    ('classification', GaussianNB())   
    ])

【讨论】：

那么，实现 fit 和 transform 就足以有一个新的特征转换步骤可以添加到管道中？
是的，这就是你所需要的
提取器是否需要参数呢？你将如何添加 set_param ？

【解决方案2】：

加上 Sebastian Raschka 和 eickenberg 的回答，转换器对象应具备的要求在 scikit-learn 的 documentation 中指定。

如果您希望估计器可用于参数估计，例如实现 set_params，那么除了具有拟合和变换之外，还有几个要求。

【讨论】：

推荐的实现set_params的方法是从BaseEstimator继承它，例如通过使用声明class my_class(TransformerMixin, BaseEstimator) 定义您的类。不要去写你自己的 set_params 方法，除非你真的确定你需要。
BaseEstimator 未知！你可以用你对 set_params 的评论来编辑答案吗？
@user702846 我没有明白你的意图。 BaseEstimator 的文档可在此处获得 - scikit-learn.org/stable/modules/generated/…

【解决方案3】：

如果你想使用Pipeline 对象，那么是的，干净的方法是编写一个转换器对象。这样做的肮脏方法是

select_3_and_4.transform = select_3_and_4.__call__
select_3_and_4.fit = lambda x: select_3_and_4

并使用select_3_and_4，就像您在管道中使用的那样。你显然也可以写一个类。

否则，如果您知道其他功能不相关，您也可以将X_train[:, 2:4] 提供给您的管道。

数据驱动的特征选择工具可能是题外话，但总是有用的：检查例如sklearn.feature_selection.SelectKBest 使用 sklearn.feature_selection.f_classif 或 sklearn.feature_selection.f_regression 与例如k=2 在你的情况下。

【讨论】：

谢谢。同时，我尝试编写一个具有transform 方法的类，它似乎可以工作

【解决方案4】：

我没有发现接受的答案很清楚，所以这是我对其他人的解决方案。基本上，这个想法是基于BaseEstimator和TransformerMixin创建一个新类

以下是基于列中 NA 百分比的特征选择器。 perc 值对应于 NA 的百分比。

from sklearn.base import TransformerMixin, BaseEstimator

class NonNAselector(BaseEstimator, TransformerMixin):

    """Extract columns with less than x percentage NA to impute further
    in the line
    Class to use in the pipline
    -----
    attributes 
    fit : identify columns - in the training set
    transform : only use those columns
    """

    def __init__(self, perc=0.1):
        self.perc = perc
        self.columns_with_less_than_x_na_id = None

    def fit(self, X, y=None):
        self.columns_with_less_than_x_na_id = (X.isna().sum()/X.shape[0] < self.perc).index.tolist()
        return self

    def transform(self, X, y=None, **kwargs):
        return X[self.columns_with_less_than_x_na_id]

    def get_params(self, deep=False):
        return {"perc": self.perc}

【讨论】：

【解决方案5】：

您可以使用以下自定义转换器来选择指定的列：

#Custom Transformer 提取作为参数传递给其构造函数的列

class FeatureSelector( BaseEstimator, TransformerMixin ):

    #Class Constructor 
    def __init__( self, feature_names ):
        self._feature_names = feature_names 

    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        return self 

    #Method that describes what we need this transformer to do
    def transform( self, X, y = None ):
        return X[ self._feature_names ]`

这里的 feature_names 是您要选择的功能列表更多详情，您可以参考这个链接 [1]：https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65

【讨论】：

谢谢。这看起来像是使用最新软件包的更新解决方案，因为问题来自 2014 年，但今天仍然非常重要。是否可以添加完整的端到端解决方案，以便答案是自包含的？例如。添加管道调用、导入等。

【解决方案6】：

另一种方法是简单地将ColumnTransformer 与“空”FunctionTransformer 一起使用：

# a FunctionTransformer with func=None yields the identity function / passthrough 
empty_func = make_pipeline(FunctionTransformer(func=None)) 

clf_all = make_pipeline(StandardScaler(), 
                        ColumnTransformer([("select", empty_func, [3, 4])]),
                        GaussianNB(),
                        )

这是因为 ColumnTransformer by default drops the remainder of columns that aren't selected.

【讨论】：