带有包含 ColumnTransformer 的管道的 RFECV答案

【问题标题】：RFECV with a pipeline containing ColumnTransformer带有包含 ColumnTransformer 的管道的 RFECV
【发布时间】：2021-09-21 11:53:49
【问题描述】：

我的问题是指在以下类似的未回答问题中提出的问题：Using a Pipeline containing ColumnTransformer in SciKit's RFECV

我正在尝试使用 RFECV 选择最相关的功能，其中包含带有以下代码的 ColumnTransformer 的管道：

from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import  ColumnTransformer
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.linear_model import HuberRegressor as hr

ind_nums_area=list(extra_enc_area_tr_data.columns.get_indexer(nums_area))

huber_prep_pipe = Pipeline([
       ('scaler',StandardScaler())
   ])
huber_col_transf= ColumnTransformer ([
    ('prep',huber_prep_pipe,ind_nums_area) 
],remainder='passthrough')

huber_pipe = Pipeline([
    ('transf',huber_col_transf),
    ('est',hr())
])
huber_ttr=TransformedTargetRegressor(regressor=huber_pipe,transformer=MinMaxScaler())
min_features=100
huber_pipe_rfecv=RFECV(huber_ttr,min_features_to_select=min_features,
cv=5,scoring='neg_root_mean_squared_error',n_jobs=-1,verbose=3,
importance_getter='regressor_.named_steps.est.coef_')
huber_pipe_rfecv.fit(extra_enc_area_tr_data,log_tr_target)

ind_nums_area 是要由 ColumnTransformer 转换的特征的索引列表（特征列表是 nums_area 变量）。我正在使用索引，因为 RFECV 将作为 pandas 数据帧的训练数据集转换为 numpy 数组，当然不允许使用列名。不幸的是，索引也不是要走的路，因为 RFECV 减少了功能的数量，并且列出的索引不正确。 ColumnTransformer 要转换的特征之一是训练数据的最后一个特征的索引 255，在这种情况下，我收到错误 IndexError: index 255 is out of bounds for axis 0 with size 255 ValueError: all features must be in [ 0, 254] 或 [-255, 0]。

您对如何解决这个问题有什么建议吗？有没有什么简单的方法可以找出在 RFECV 的每次迭代中删除了哪些特征并相应地调整索引？也许您有一个建议，如何避免通过 RFECV 将数据帧转换为 numpy 数组？否则，您是否知道不将训练数据帧转换为 numpy 数组的 sklearn RFCEV 替代方法？

我不想在将所有数据传递给估算器之前对其进行转换，因为这会将缩放器信息泄漏到测试折叠中。

如何在不泄露数据的情况下处理？

【问题讨论】：

标签： python pandas numpy scikit-learn feature-selection

【解决方案1】：

一段时间以来，我一直在尝试以类似的方式解决这个问题，直到我厌倦了 scikit-learn 复杂的内部转换，并决定编写自己的 rfecv 来使用流水线转换器。

基本上，我实现了 Guyon、Isabelle 等人的算法。 “使用支持向量机进行癌症分类的基因选择。”机器学习 46.1 (2002): 389-422，这也是 scikit-learn 实现的基础。

def rfecv(X, y, estimator,
          min_features_to_select=3, 
          splits=3,
          step=3,
          scoring_metric="f1",
          scoring_decimals=3,
          random_state=None):
    """
    This method is an implementation the recursive feature eliminationalgorithm, 
    which eliminates unneccessary features. As scikit-learn only provides an RFECV 
    version [1] that makes using Pipelines very difficult, we have implemented our 
    own version based on the original paper [2].
    
    [1] https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html
    [2] Guyon, Isabelle, et al. "Gene selection for cancer classification using support vector machines." 
        Machine learning 46.1 (2002): 389-422.

    :X: a DataFrame containing the features.
    :y: a Series containing the labels.
    :estimator: a scikit-learn estimator or a Pipeline. If a pipeline is passed,
        the last element of the pipeline is assumed to be a classifier providing
        a feature_importances_ attribute.
    :min_features_to_select: the minimum number of features to evaluate.
    :split: number of splits for to use for cross validation.
    :step: the amount of features to be reduced during each step.
    :scoring_metric: the scoring metric to use for evaluation (e.g., "f_one" or 
        a callable implementing the sklearn scoring interface).
    :scoring_decimals: the scoring metric can be rounded to N decimals to avoid 
        the reduction from getting stuck with a larger number of features with
        very small score gains. Defaults to 3 digits. If None is passed, full
        scoring precision is used.
    :random_state: if not None, this is the seed for all RNGs used in this function.
        
    :returns: best_features, best_score, ranks, scores; best_features is a list
        of features, best_score is the mean score achieved with these features over the
        folds, ranks is the order of eliminated features (from most relevant to most irrelevant),
        scores is the list of mean scores for each step achieved during the feature 
        elimination across all folds.
    """
    # Initialize survivors and ranked list
    survivors = list(X.columns)
    ranks = []
    scores = []
    
    # While the survivor list is longer than min_features_to_select
    while len(survivors) >= min_features_to_select:
                
        # Get only the surviving features
        X_tmp = X[survivors]
        
        # Train and get the scores, cross_validate clones 
        # the model internally, so this does not modify
        # the estimator passed to this function
        print("[%.2f] evaluating %i features ..." % (time(), len(X_tmp.columns)))
        cv_result = cross_validate(estimator, X_tmp, y,
                                   cv=KFold(n_splits=splits, 
                                            shuffle=True, 
                                            random_state=random_state),
                                   scoring=scoring_metric,
                                   return_estimator=True)
        
        # Append the mean performance to 
        score = np.mean(cv_result["test_score"])
        if scoring_decimals is None:
            scores.append(score)
        else:
            scores.append(round(score, scoring_decimals))            
        print("[%.2f] ... score %f." % (time(), scores[-1]))
        
        # Get feature weights from the model fitted 
        # on the best fold and square the weights as described 
        # in the paper. If the estimator is a Pipeline,
        # we get the weights from the last element.
        best_estimator = cv_result["estimator"][np.argmax(cv_result["test_score"])]
        if isinstance(best_estimator, Pipeline):
            weights = best_estimator[-1].feature_importances_
        else:
            weights = best_estimator.feature_importances_
        weights = list(np.power(weights, 2))
                
        # Remove step features (but respect min_features_to_select)
        for _ in range(max(min(step, len(survivors) - min_features_to_select), 1)):
            
            # Find the feature with the smallest ranking criterion
            # and update the ranks and survivors
            idx = np.argmin(weights)
            ranks.insert(0, survivors.pop(idx))
            weights.pop(idx)
            
    # Calculate the best set of surviving features
    ranks_reverse = list(reversed(ranks))
    last_max_idx = len(scores) - np.argmax(list(reversed(scores))) - 1
    removed_features = set(ranks_reverse[0:last_max_idx * step])
    best_features = [f for f in X.columns if f not in removed_features]
    
    # Return ranks and scores
    return best_features, max(scores), ranks, scores

您需要知道的一切都记录在文档字符串中。唯一的例外是如何解释返回的排名和分数列表。在 step 为 1 的情况下，通过删除 list(reversed(ranks))[0:i] 中的所有特征来获得 score[i] 的分数（因为 rank 是从最相关到最相关的删除特征的列表无关）。

DecisionTree 的最小工作示例如下所示（但它当然也适用于管道和转换器，如果管道中的分类器是最后一个元素）：

Python 3.9.1 (default, Dec 11 2020, 14:32:07) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.tree import DecisionTreeClassifier
>>> test_data = load_breast_cancer(as_frame=True)
>>> clf = DecisionTreeClassifier(random_state=0)
>>> clf.fit(test_data.data, test_data.target)
DecisionTreeClassifier(random_state=0)
>>> best_features, best_score, _, _ = rfecv(test_data.data, test_data.target, clf, step=1, min_features_to_select=1, random_state=0)
[1626774242.35] evaluating 30 features ...
[1626774242.38] ... score 0.944000.
[1626774242.38] evaluating 29 features ...
[1626774242.42] ... score 0.938000.
[1626774242.42] evaluating 28 features ...
[1626774242.47] ... score 0.948000.
[1626774242.47] evaluating 27 features ...
[1626774242.50] ... score 0.934000.
[1626774242.51] evaluating 26 features ...
[1626774242.54] ... score 0.938000.
[1626774242.54] evaluating 25 features ...
[1626774242.58] ... score 0.939000.
[1626774242.58] evaluating 24 features ...
[1626774242.62] ... score 0.941000.
[1626774242.62] evaluating 23 features ...
[1626774242.65] ... score 0.944000.
[1626774242.65] evaluating 22 features ...
[1626774242.68] ... score 0.953000.
[1626774242.68] evaluating 21 features ...
[1626774242.70] ... score 0.940000.
[1626774242.70] evaluating 20 features ...
[1626774242.72] ... score 0.941000.
[1626774242.72] evaluating 19 features ...
[1626774242.75] ... score 0.943000.
[1626774242.75] evaluating 18 features ...
[1626774242.77] ... score 0.942000.
[1626774242.77] evaluating 17 features ...
[1626774242.79] ... score 0.944000.
[1626774242.79] evaluating 16 features ...
[1626774242.80] ... score 0.945000.
[1626774242.80] evaluating 15 features ...
[1626774242.82] ... score 0.935000.
[1626774242.82] evaluating 14 features ...
[1626774242.84] ... score 0.935000.
[1626774242.84] evaluating 13 features ...
[1626774242.86] ... score 0.947000.
[1626774242.86] evaluating 12 features ...
[1626774242.87] ... score 0.950000.
[1626774242.87] evaluating 11 features ...
[1626774242.89] ... score 0.950000.
[1626774242.89] evaluating 10 features ...
[1626774242.91] ... score 0.944000.
[1626774242.91] evaluating 9 features ...
[1626774242.92] ... score 0.948000.
[1626774242.92] evaluating 8 features ...
[1626774242.94] ... score 0.953000.
[1626774242.94] evaluating 7 features ...
[1626774242.95] ... score 0.953000.
[1626774242.95] evaluating 6 features ...
[1626774242.97] ... score 0.949000.
[1626774242.97] evaluating 5 features ...
[1626774242.98] ... score 0.951000.
[1626774242.98] evaluating 4 features ...
[1626774243.00] ... score 0.947000.
[1626774243.00] evaluating 3 features ...
[1626774243.01] ... score 0.950000.
[1626774243.01] evaluating 2 features ...
[1626774243.02] ... score 0.942000.
[1626774243.02] evaluating 1 features ...
[1626774243.03] ... score 0.911000.
>>> print(best_features, best_score)
['area error', 'smoothness error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst concavity', 'worst concave points'] 0.953

问候，

马蒂亚斯

【讨论】：

Matthias，感谢您指导我找到解决方案！你帮了我很多。您能否向我们展示importance_getter 可调用的定义。我想不通这个。请参阅下面我提出的作为答案发布的代码。如果您有任何 cmets，请告诉我。
您好 Wlodek，我目前正在构建上述代码的生产版本（我发现它仍然存在问题）。完成后我会尽快发布。
太棒了！看到它会很酷。你可以看看我的代码。我已经更正了您代码中的一些错误。
我刚刚用我今天的状态更新了帖子。我在我目前正在开发的大型 ML 模型上对其进行了测试，它似乎可以完成这项工作。但是，代码还没有被测试覆盖。可能仍然存在错误；）对于您关于importance_getter的问题，我改用了另一种方法。
添加了我的最终测试版本。

【解决方案2】：

这是我在编辑前根据最初的 Matthias 回答提出的代码。请记住，考虑到 ColumnTransformer 正在更改列的顺序，训练数据列的顺序必须与 CT 输出顺序相同，否则此代码不起作用。我仍在试图弄清楚如何将importance_getter设置为函数的参数。如果您对如何做有任何建议，请告诉我。

def df_rfecv(X, y, estimator,
          min_features_to_select=240, 
          splits=5, 
          scoring="neg_root_mean_squared_error"):
    
    # Initialize survivors, ranks, scores, indexes
    survivors = list(X.columns)
    ranks = ['None removed']
    scores = []
    indexes = []
    
    for i in range (len (X.columns),min_features_to_select-1,-1 ):    
        # Get only the surviving features
        X_tmp = X[survivors]
        
                
        # Train and get the scores
        cr_val = cross_validate(ttr, X_tmp, y,
                               cv=KFold(n_splits=splits,),
                               scoring=scoring,return_estimator =True)
        scores.append(np.mean(cr_val["test_score"]))
        
        # Get squared feature weights
        for n,estimator in enumerate(cr_val['estimator']):
            if n == 0:
                all_coefs=[cr_val['estimator'][n].regressor_.named_steps['est'].coef_]
            else:
                all_coefs = np.concatenate((all_coefs,[cr_val['estimator'][n].regressor_.named_steps['est'].coef_]))
        mean_coefs = np.mean(all_coefs,axis=0)
        weights = np.power(mean_coefs, 2)
        
        # Find the feature with the smallest weight
        idx = np.argmin(weights)
        indexes.append(i)
        # Update ranks and survivors
        if i < len(training_data.columns):   
            try:
                #try to remove deleted feature from columns transformed by ColumnTransformer    
                global prep_trans_cols
                prep_trans_cols.remove(survivors[idx])
            except:
                pass
        if i > min_features_to_select:    
            ranks.append(survivors.pop(idx))
        
    # Return ranks, scores in dataframe with indices indicating the number of features
    return pd.DataFrame({'Feature removed':ranks,'Scores':scores},index=indexes)

【讨论】：