scikit-learn 管道中使用递归特征消除的网格搜索返回错误答案

【问题标题】：Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an errorscikit-learn 管道中使用递归特征消除的网格搜索返回错误
【发布时间】：2016-08-09 13:21:21
【问题描述】：

我正在尝试使用 scikit-learn 在管道中链接网格搜索和递归特征消除。

带有“裸”分类器的 GridSearchCV 和 RFE 工作正常：

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

selector = feature_selection.RFE(est)
param_grid = dict(estimator__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)

将分类器放入管道会返回错误：RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)

selector = feature_selection.RFE(pipe)
param_grid = dict(estimator__clf__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)

编辑：

我意识到我没有清楚地描述问题。这是更清晰的sn-p：

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

# This will work
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10]})
clf.fit(X, y)

# This will not work
est = pipeline.make_pipeline(SVR(kernel="linear"))
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10]})
clf.fit(X, y)

如您所见，唯一的区别是将估算器放入管道中。然而，管道隐藏了“coef_”或“feature_importances_”属性。问题是：

在 scikit-learn 中有没有很好的方法来处理这个问题？
如果不是，是否出于某种原因需要这种行为？

EDIT2：

根据@Chris 提供的答案更新，工作 sn-p

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR


class MyPipe(pipeline.Pipeline):

    def fit(self, X, y=None, **fit_params):
        """Calls last elements .coef_ method.
        Based on the sourcecode for decision_function(X).
        Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
        ----------
        """
        super(MyPipe, self).fit(X, y, **fit_params)
        self.coef_ = self.steps[-1][-1].coef_
        return self


X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

# Without Pipeline
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)

# With Pipeline
est = MyPipe([('svr', SVR(kernel="linear"))])
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)

【问题讨论】：

我会反省源代码以检查导致 RuntimeError 的事件链。您很有可能能够覆盖相关返回对象的属性并简单地添加回变量 - 例如，如果从 SVR() 返回时它们相同。无论如何，make_pipeline() 可能不会返回与 SVR() 相同类型的对象。

标签： python scikit-learn

【解决方案1】：

我认为您构建管道的方式与pipeline documentation 中列出的方式略有不同。

你在找这个吗？

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

std_scaler = preprocessing.StandardScaler()
selector = feature_selection.RFE(est)
pipe_params = [('feat_selection',selector),('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)

param_grid = dict(clf__C=[0.1, 1, 10])
clf = GridSearchCV(pipe, param_grid=param_grid, cv=2)
clf.fit(X, y)
print clf.grid_scores_

另请参阅useful example，了解如何在管道中组合事物。对于RFE 对象，我只是使用official documentation 与您的SVR 估计器一起构建它 - 然后我只需将RFE 对象放入管道中，就像您使用缩放器和估计器对象一样。

【讨论】：

感谢您的回答，但您的解决方案与我预期的不同。您的工作流程： 1. 在 GridSearchCV 期间，使用RFE(SVR()) 选择功能，默认值为C。 2. 然后，对这些选定的特征进行缩放。 3. SVR() 与param_grid 中的一个参数相匹配。我想要的工作流程如下： 1. 在 GridSearchCV 特征被缩放。 2. SVR() 与param_grid 中的一个参数相匹配。 3. 然后，从模型中剪除权重最小的特征。 4. 重复步骤 1-3，直到达到要选择的所需特征数量。

【解决方案2】：

您在使用管道时遇到问题。

管道的工作原理如下：

当您调用 .fit(x,y) 等时，第一个对象将应用于数据。如果该方法公开了一个 .transform() 方法，则会应用该方法并将该输出用作下一阶段的输入。

管道可以将任何有效模型作为最终对象，但所有之前的模型都必须公开 .transform() 方法。

就像管道一样 - 您输入数据，管道中的每个对象获取先前的输出并对其进行另一次转换。

如我们所见，

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform

RFE 公开了一个转换方法，因此应该包含在管道本身中。例如。

some_sklearn_model=RandomForestClassifier()
selector = feature_selection.RFE(some_sklearn_model)
pipe_params = [('std_scaler', std_scaler), ('RFE', rfe),('clf', est)]

您的尝试存在一些问题。首先，您正在尝试缩放数据的一部分。想象一下，我有两个分区 [1,1]、[10,10]。如果我通过分区的平均值进行标准化，我会丢失我的第二个分区明显高于平均值的信息。您应该在开始时缩放，而不是在中间。

其次，SVR 没有实现转换方法，您不能将其作为非最终元素合并到管道中。

RFE 采用适合数据的模型，然后评估每个特征的权重。

编辑：

如果您愿意，可以通过将 sklearn 管道包装在您自己的类中来包含此行为。我们想要做的是当我们拟合数据时，检索最后一个估计器 .coef_ 方法并将其本地存储在我们的派生类中以正确的名称。我建议您查看 github 上的源代码，因为这只是第一次开始，可能需要更多的错误检查等。 Sklearn 使用了一个名为@if_delegate_has_method 的函数装饰器，可以很方便地添加它以确保方法通用。我已经运行了这段代码以确保它可以正常运行，但仅此而已。

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

class myPipe(pipeline.Pipeline):

    def fit(self, X,y):
        """Calls last elements .coef_ method.
        Based on the sourcecode for decision_function(X).
        Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
        ----------
        """

        super(myPipe, self).fit(X,y)

        self.coef_=self.steps[-1][-1].coef_
        return

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

selector = feature_selection.RFE(est)
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler),('select', selector), ('clf', est)]

pipe = myPipe(pipe_params)



selector = feature_selection.RFE(pipe)
clf = GridSearchCV(selector, param_grid={'estimator__clf__C': [2, 10]})
clf.fit(X, y)

print clf.best_params_

如果有不清楚的地方，请追问。

【讨论】：

感谢@Chris 的回答。我不清楚描述这个问题。我特别想知道如何访问似乎被管道隐藏的估计器的“coef_”或“feature_importances_”属性。