如何确定 SelectFromModel() 中选择特征的阈值？答案

【问题标题】：How to decide threshold value in SelectFromModel() for selecting features?如何确定 SelectFromModel() 中选择特征的阈值？
【发布时间】：2018-08-26 23:40:56
【问题描述】：

我正在使用随机森林分类器进行特征选择。我总共有 70 个特征，我想从 70 个特征中选择最重要的特征。下面的代码显示了分类器，显示了从最重要到最不重要的特征。

代码：

feat_labels = data.columns[1:]
clf = RandomForestClassifier(n_estimators=100, random_state=0)

# Train the classifier
clf.fit(X_train, y_train)

importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))

现在我正在尝试使用sklearn.feature_selection 中的SelectFromModel，但我如何确定给定数据集的阈值。

# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)

# Train the selector
sfm.fit(X_train, y_train)

当我尝试 threshold=0.15 然后尝试训练我的模型时，我收到一条错误消息，指出数据过于嘈杂或选择过于严格。

但是，如果我使用threshold=0.015，我可以在选定的新功能上训练我的模型那么我该如何确定这个阈值呢？

【问题讨论】：

标签： python pandas numpy machine-learning scikit-learn

【解决方案1】：

我会尝试以下方法：

从低阈值开始，例如：1e-4
使用SelectFromModel fit & transform 减少特征
为您的估算器（在您的情况下为RandomForestClassifier）计算所选特征的指标（准确度等）
提高阈值并从第 1 点开始重复所有步骤。

使用这种方法，您可以估算出最适合您的特定数据和估算器的 threshold

【讨论】：