【问题标题】:Scikit-learn - feature reduction using RFECV and GridSearch. Where are the coefficients stored?Scikit-learn - 使用 RFECV 和 GridSearch 减少特征。系数存储在哪里?
【发布时间】:2015-09-12 13:52:15
【问题描述】:

我正在使用 Scikit-learn RFECV 通过交叉验证为逻辑回归选择最重要的特征。假设 X 是特征的 [n,x] 数据框,y 表示响应变量:

from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs

#  Create a logistic regression estimator 
logreg = lm.LogisticRegression()

# Use RFECV to pick best features, using Stratified Kfold
rfecv =   RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

# 
pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())

# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']

skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range,  logisticregression__penalty=penalty_options)

grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')

grid.fit(X_new, y) 

两个问题:

a) 这是特征、超参数选择和拟合的正确过程吗?

b) 我在哪里可以找到所选特征的拟合系数?

【问题讨论】:

    标签: python scikit-learn


    【解决方案1】:

    这是特征选择的正确过程吗? 这是特征选择的众多方式之一。递归特征消除是一种自动化的方法,others are listed in scikit.learn documentation。它们有不同的优缺点,通常最好通过涉及常识和尝试具有不同特征的模型来实现特征选择。 RFE 是一种选择一组好的特性的快速方法,但并不一定会给您最终最好的。顺便说一句,您不需要单独构建 StratifiedKFold。如果您只是将cv 参数设置为cv=3,那么RFECVGridSearchCV 将在y 值是二进制或多类时自动使用StratifiedKFold,我假设这很可能是因为您使用的是@ 987654326@。 也可以组合

    # Fit the features to the response variable
    rfecv.fit(X, y)
    
    # Put the best features into new df X_new
    X_new = rfecv.transform(X)
    

    进入

    X_new = rfecv.fit_transform(X, y)
    

    这是选择超参数的正确过程吗? GridSearchCV 基本上是一种自动化的方法,它系统地尝试一整套模型参数组合,并根据一些性能指标从中挑选出最好的。这是找到合适参数的好方法,是的。

    这是正确的拟合过程吗? 是的,这是拟合模型的有效方法。当您调用grid.fit(X_new, y) 时,它会生成LogisticRegression 估计器的网格(每个估计器都有一组尝试过的参数)并适合每个估计器。它将在grid.best_estimator_ 下保留性能最好的那个,在grid.best_params_ 中保留此估计器的参数,在grid.best_score_ 下保留此估计器的性能分数。它会返回自己,而不是最好的估计器。请记住,对于您将使用模型进行预测的传入新 X 值,您必须使用拟合的 RFECV 模型应用变换。因此,您实际上也可以将此步骤添加到管道中。

    在哪里可以找到所选特征的拟合系数? grid.best_estimator_ 属性是具有所有这些信息的 LogisticRegression 对象,因此 grid.best_estimator_.coef_ 具有所有系数(而 grid.best_estimator_.intercept_ 是截距)。请注意,为了能够得到这个grid.best_estimator_GridSearchCV 上的refit 参数需要设置为True,但无论如何这是默认值。

    【讨论】:

    • 非常感谢。非常有帮助。我不明白的一件事是需要转换:如果它选择 n 个特征,究竟是什么被“转换”了? (顺便说一句,我不确定它是如何确定这一点的——必须有一个阈值)。我正在使用的启发式方法是 RFECV 选择“n”个最佳功能并丢弃其他功能......
    • 对于我上面的问题,我收到错误消息:当我尝试按照您上面的描述查看 coef_ 时,“管道”对象没有属性“coef_”。也很想知道为什么您声称 Stratified K Fold 被选择用于任何分类问题(确实如此):我认为 Kfold 是默认设置,分层 Kfold 用于不平衡类(我有)。
    【解决方案2】:

    基本上,您需要对样本数据进行训练-验证-测试拆分。训练集用于调整正常参数,验证集用于调整网格搜索中的超参数,测试集用于性能评估。这是执行此操作的一种方法。

    from sklearn.datasets import make_classification
    from sklearn.pipeline import make_pipeline
    from sklearn.grid_search import GridSearchCV
    from sklearn.cross_validation import StratifiedKFold
    from sklearn.preprocessing import StandardScaler
    from sklearn.feature_selection import RFECV
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import classification_report
    import pandas as pd
    
    
    # simulate some artifical data so that I can show you the result of each intermediate step
    # 1000 obs, X dim 1000-by-100, 2 different y labels with unbalanced weights
    X, y = make_classification(n_samples=1000, n_features=100, n_informative=5, n_classes=2, weights=[0.1, 0.9])
    
    X.shape
    
    Out[78]: (1000, 100)
    
    y.shape
    
    Out[79]: (1000,)
    
    # Nested Cross-Validation, this returns an train/test index interator
    split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
    # to take a look at the split, you will see it has 5 tuples
    list(split)
    # the 1st fold
    train_index = list(split)[0][0]
    
    Out[80]: array([  0,   1,   2, ..., 997, 998, 999])
    
    test_index = list(split)[0][1]
    
    Out[81]: array([  5,  12,  17, ..., 979, 982, 984])
    
    # let's play with just one iteration for now
    # your pipe
    pipe = make_pipeline(StandardScaler(), LogisticRegression())
    
    # set up params
    params_space = dict(logisticregression__C=10.0**np.arange(-5,1),
                        logisticregression__penalty=['l1', 'l2'],
                        logisticregression__class_weight=[None, 'auto'])
    
    # apply your grid search only in train data but with a futher cv step
    # so original train set has [gscv_train, gscv_validation] where the latter is used to tune hyperparameters
    # all performance is still evaluated in a separated held-out 'test' set
    grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
    # fit the data on train set
    grid.fit(X[train_index], y[train_index])
    
    # to get the params of your estimator, call your gscv
    grid.best_estimator_
    Out[82]: 
    Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.10000000000000001, class_weight=None, dual=False,
              fit_intercept=True, intercept_scaling=1, max_iter=100,
              multi_class='ovr', penalty='l1', random_state=None,
              solver='liblinear', tol=0.0001, verbose=0))])
    
    
    # the performance in validation set
    grid.grid_scores_
    Out[83]: 
    [mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.87975, std: 0.01753, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87985, std: 0.01746, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.88033, std: 0.01707, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87975, std: 0.01732, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.88245, std: 0.01732, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87955, std: 0.01686, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.88746, std: 0.02318, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87990, std: 0.01634, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.94002, std: 0.02959, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.87419, std: 0.02174, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.93508, std: 0.03101, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87091, std: 0.01860, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.88013, std: 0.03246, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.85247, std: 0.02712, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.88904, std: 0.02906, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.85197, std: 0.02097, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'}]
    
    
    # or the best score among them
    grid.best_score_
    Out[84]: 0.94002188482393367
    
    # now after finishing training the estimator, we now predict in test set
    y_pred = grid.predict(X[test_index])
    # since LogisticRegression is probability based model, we have the luxury to get the propability for each obs
    y_pred_probs = grid.predict_proba(X[test_index])
    
    Out[87]: 
    array([[ 0.0632,  0.9368],
           [ 0.0236,  0.9764],
           [ 0.0227,  0.9773],
           ..., 
           [ 0.0108,  0.9892],
           [ 0.2903,  0.7097],
           [ 0.0113,  0.9887]])
    
    # to get evaluation result, 
    print(classification_report(y[test_index], y_pred))
    
                 precision    recall  f1-score   support
    
              0       0.93      0.59      0.72        22
              1       0.95      0.99      0.97       179
    
    avg / total       0.95      0.95      0.95       201
    
    
    
    # to put all things together with the nested cross-validation
    # generate a pandas dataframe to store prediction probability
    kfold_df = pd.DataFrame(0.0, index=np.arange(len(y)), columns=unique(y))
    report = []  # to store classificaiton report
    
    split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
    
    for train_index, test_index in split:
    
        grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
    
        grid.fit(X[train_index], y[train_index])
    
        y_pred_probs = grid.predict_proba(X[test_index])
        kfold_df.iloc[test_index, :] = y_pred_probs
    
        y_pred = grid.predict(X[test_index])
        report.append(classification_report(y[test_index], y_pred))
    
    # your result
    print(kfold_df)
    
    Out[88]: 
              0       1
    0    0.1710  0.8290
    1    0.0083  0.9917
    2    0.2049  0.7951
    3    0.0038  0.9962
    4    0.0536  0.9464
    5    0.0632  0.9368
    6    0.1243  0.8757
    7    0.1150  0.8850
    8    0.0796  0.9204
    9    0.4096  0.5904
    ..      ...     ...
    990  0.0505  0.9495
    991  0.2128  0.7872
    992  0.0270  0.9730
    993  0.0434  0.9566
    994  0.8078  0.1922
    995  0.1452  0.8548
    996  0.1372  0.8628
    997  0.0127  0.9873
    998  0.0935  0.9065
    999  0.0065  0.9935
    
    [1000 rows x 2 columns]
    
    
    for r in report:
        print(r)
    
    for r in report:
        print(r)
                 precision    recall  f1-score   support
    
              0       0.93      0.59      0.72        22
              1       0.95      0.99      0.97       179
    
    avg / total       0.95      0.95      0.95       201
    
                 precision    recall  f1-score   support
    
              0       0.86      0.55      0.67        22
              1       0.95      0.99      0.97       179
    
    avg / total       0.94      0.94      0.93       201
    
                 precision    recall  f1-score   support
    
              0       0.89      0.38      0.53        21
              1       0.93      0.99      0.96       179
    
    avg / total       0.93      0.93      0.92       200
    
                 precision    recall  f1-score   support
    
              0       0.88      0.33      0.48        21
              1       0.93      0.99      0.96       178
    
    avg / total       0.92      0.92      0.91       199
    
                 precision    recall  f1-score   support
    
              0       0.88      0.33      0.48        21
              1       0.93      0.99      0.96       178
    
    avg / total       0.92      0.92      0.91       199
    

    【讨论】:

    • 这非常有用,感谢您的代码!我不确定我是否理解为什么在使用 CV 时需要保留一些数据:我认为 CV 的全部目的是避免像在 train_test_split 中那样保留数据。
    • 有趣的观察,您的代码在生成 df 的情况下运行良好,但我需要使用 '.iloc' 来正确索引我的 df。
    • @GlennBlasius 感谢您指出这种潜在的不一致行为。我明白您的意思:如果您使用的是 pandas.DataFrame,那么附加到 df 的默认 int 索引可能会导致拆分时出现意外选择。如您所说,为避免这种情况,您可以将 df 传递给您,但将 .loc 更改为 iloc
    • @GlennBlasius CV 的目的是选择最佳算法。它仍然是一种优化(虽然不在标准凸优化中,但我们使用穷举网格搜索)。
    • 一般规则是,如果你正在优化你的算法,那么必须有一个保留的数据集来评估性能(不做模型选择,而只是评估,所以对于测试数据集,你应该只应用一种算法而不是其中的许多算法)。这里嵌套CV的原因是标准模型对内部参数进行了优化,验证集用于选择最佳超参数,然后使用最终测试集进行评估。
    猜你喜欢
    • 2015-11-13
    • 2016-08-31
    • 2017-07-18
    • 2016-08-26
    • 2015-06-22
    • 2014-06-11
    • 2020-10-24
    • 2016-02-25
    • 2018-02-24
    相关资源
    最近更新 更多