【问题标题】:Feature-selection and prediction特征选择和预测
【发布时间】:2018-12-27 04:44:05
【问题描述】:
from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris

我有 X 和 Y 数据。

data = load_iris()    
X = data.data
Y = data.target 

我想使用 k-fold 验证方法来实现 RFECV 特征选择和预测。

从答案@https://stackoverflow.com/users/3374996/vivek-kumar 纠正的代码

clf = RandomForestClassifier()

kf = KFold(n_splits=2, shuffle=True, random_state=0)  

estimators = [('standardize' , StandardScaler()),
              ('clf', clf)]

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, cv=kf, scoring='accuracy', verbose=10)
rfecv_data = rfecv.fit(X, Y)

print ('no. of selected features =', rfecv_data.n_features_) 

编辑(少量剩余):

X_new = rfecv.transform(X)
print X_new.shape

y_predicts = cross_val_predict(clf, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

【问题讨论】:

    标签: scikit-learn


    【解决方案1】:

    不要将 StandardScaler 和 RFECV 包装在同一管道中,而是对 StandardScaler 和 RandomForestClassifier 执行此操作,并将该管道作为估计器传递给 RFECV。在此不会泄露任何 traininf 信息。

    estimators = [('standardize' , StandardScaler()),
                  ('clf', RandomForestClassifier())]
    
    pipeline = Pipeline(estimators)
    
    
    rfecv = RFECV(estimator=pipeline, scoring='accuracy')
    rfecv_data = rfecv.fit(X, Y)
    

    更新:关于错误'RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes'

    是的,这是 scikit-learn 管道中的一个已知问题。您可以查看我的其他 answer here for 更多详细信息并使用我在那里创建的新管道。

    像这样定义一个自定义管道:

    class Mypipeline(Pipeline):
        @property
        def coef_(self):
            return self._final_estimator.coef_
        @property
        def feature_importances_(self):
            return self._final_estimator.feature_importances_ 
    

    然后使用它:

    pipeline = Mypipeline(estimators)
    
    rfecv = RFECV(estimator=pipeline, scoring='accuracy')
    rfecv_data = rfecv.fit(X, Y)
    

    更新 2

    @brute,对于您的数据和代码,算法会在一分钟内在我的 PC 上完成。这是我使用的完整代码:

    import numpy as np
    import glob
    from sklearn.utils import resample
    files = glob.glob('/home/Downloads/Untitled Folder/*') 
    outs = [] 
    for fi in files: 
        data = np.genfromtxt(fi, delimiter='|', dtype=float) 
        data = data[~np.isnan(data).any(axis=1)] 
        data = resample(data, replace=False, n_samples=1800, random_state=0) 
        outs.append(data) 
    
    X = np.vstack(outs) 
    print X.shape 
    Y = np.repeat([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1800) 
    print Y.shape
    
    #from sklearn.utils import shuffle
    #X, Y = shuffle(X, Y, random_state=0)
    
    from sklearn.feature_selection import RFECV
    from sklearn.model_selection import KFold
    from sklearn.preprocessing import StandardScaler
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.pipeline import Pipeline
    
    clf = RandomForestClassifier()
    
    kf = KFold(n_splits=10, shuffle=True, random_state=0)  
    
    estimators = [('standardize' , StandardScaler()),
                  ('clf', RandomForestClassifier())]
    
    class Mypipeline(Pipeline):
        @property
        def coef_(self):
            return self._final_estimator.coef_
        @property
        def feature_importances_(self):
            return self._final_estimator.feature_importances_ 
    
    pipeline = Mypipeline(estimators)
    
    rfecv = RFECV(estimator=pipeline, scoring='accuracy', verbose=10)
    rfecv_data = rfecv.fit(X, Y)
    
    print ('no. of selected features =', rfecv_data.n_features_) 
    

    更新 3:对于 cross_val_predict

    X_new = rfecv.transform(X)
    print X_new.shape
    
    # Here change clf to pipeline, 
    # because RFECV has found features according to scaled data,
    # which is not present when you pass clf 
    y_predicts = cross_val_predict(pipeline, X_new, Y, cv=kf)
    accuracy = accuracy_score(Y, y_predicts)
    print ('accuracy =', accuracy)
    

    【讨论】:

    • 这是一个过于复杂且不必要的 hack 恕我直言。
    • @EkabaBisong 也许它有点复杂,但不是不必要的。这样做是为了防止数据泄露。
    【解决方案2】:

    我们将这样做:

    适合训练集

    from sklearn.feature_selection import RFECV
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import cross_val_predict, KFold
    from sklearn.preprocessing import StandardScaler
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.pipeline import Pipeline
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    
    data = load_iris()    
    X = data.data, Y = data.target
    
    # split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, Y, shuffle=True)
    
    # create model
    clf = RandomForestClassifier()    
    # instantiate K-Fold
    kf = KFold(n_splits=10, shuffle=True, random_state=0)
    
    # pipeline estimators
    estimators = [('standardize' , StandardScaler()),
                 ('rfecv', RFECV(estimator=clf, cv=kf, scoring='accuracy'))]
    
    # instantiate pipeline
    pipeline = Pipeline(estimators)    
    # fit rfecv to train model
    rfecv_model = rfecv_model = pipeline.fit(X_train, y_train)
    
    # print number of selected features
    print ('no. of selected features =', pipeline.named_steps['rfecv'].n_features_)
    # print feature ranking
    print ('ranking =', pipeline.named_steps['rfecv'].ranking_)
    
    'Output':
    no. of selected features = 3
    ranking = [1 2 1 1]
    

    在测试集上预测

    # make predictions on the test set
    predictions = rfecv_model.predict(X_test)
    
    # evaluate the model performance using accuracy metric
    print("Accuracy on test set: ", accuracy_score(y_test, predictions))
    
    'Output':
    Accuracy:  0.9736842105263158
    

    【讨论】:

    • @brute 否。此代码不会在任何地方使用 StandardScaler。您只需在管道内定义它,但它没有使用(安装在任何地方)。当您这样做 pipeline.named_steps['rfecv'].fit(X_train, y_train) 时,您直接在原始数据上使用 RFECV,而不是缩放数据。
    • @brute。代码已更新。这会正确使用管道来扩展和使用 RFE。
    • @VivekKumar。请定义泄漏数据。你错了。我真的不在乎 rfecv 对训练数据 x_train 做了什么。这里重要的是,我们首先使用train_test_split 方法将数据集拆分为训练集和测试集。我们拟合train 集合的rfecv 方法,并在test 集合上进行预测。 test 集中没有数据泄露到 train 集中。不要混淆 OP。
    • 这就是问题所在。 RFECV 将再次将 X_train 拆分为训练和测试(使用 cv 折叠),在拆分之前对数据进行缩放,以便训练 rfecv 的数据,然后模型知道该测试数据,因为它使用测试数据进行了缩放(我在说关于内部训练和测试)。然后你发现这很多特性对这个很重要,这会产生偏见。
    猜你喜欢
    • 2016-07-27
    • 2023-03-10
    • 2021-07-26
    • 2020-09-04
    • 2013-02-21
    • 2017-04-06
    • 2013-04-26
    相关资源
    最近更新 更多