scikit-learn 中回归交叉验证的递归特征消除答案

【问题标题】：Recursive feature elimination with cross validation for regression in scikit-learnscikit-learn 中回归交叉验证的递归特征消除
【发布时间】：2017-03-30 09:38:38
【问题描述】：

我想使用 scikit-learn 对我的回归问题应用递归特征消除之类的包装方法。 Recursive feature elimination with cross-validation 很好地概述了如何自动调整功能数量。

我试过这个：

modelX = LogisticRegression()
rfecv = RFECV(estimator=modelX, step=1, scoring='mean_absolute_error')
rfecv.fit(df_normdf, y_train)
print("Optimal number of features : %d" % rfecv.n_features_)

# Plot number of features VS. cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)
plt.show()`

但我收到类似

的错误消息

`The least populated class in y has only 1 members, which is too few. 
The minimum number of labels for any class cannot be less than n_folds=3. % (min_labels, self.n_folds)), Warning)

警告听起来像是我有分类问题，但我的任务是回归问题。我该怎么做才能得到结果？出了什么问题？

【问题讨论】：

你想让我们看看你的y_train 吗？
我的 y_train 有 1 列和 ~10.000 行，值在 1 到 200 之间。
值是整数吗？如果是这样，我认为它认为它是一个多类分类问题。尝试将值转换为浮点数。
这是有道理的。我已将值转换为浮点数，但出现相同的警告。
我现在知道了，我会试着纠正一个答案

标签： python matplotlib scikit-learn wrapper rfe

【解决方案1】：

这是发生了什么：

默认情况下，当用户没有指定折叠次数时，RFE的交叉验证使用3-fold交叉验证。到目前为止一切顺利。

但是，如果您查看documentation，它还使用StartifiedKFold，它通过保留每个类的样本百分比来确保创建折叠。因此，由于您的输出y 的某些元素似乎（根据错误）是唯一，因此它们不能同时位于 3 个不同的折叠中。它会引发错误！

错误来自here。

然后您需要使用未分层的 K 折叠：KFold。

RFECV 的文档说： "If the estimator is a classifier or if y is neither binary nor multiclass, sklearn.model_selection.KFold is used."

【讨论】：