【问题标题】:Feature selection for multi-step regression with SelectFromModel and MultiOutputRegressor. How to get selected features and their feature importance?使用 SelectFromModel 和 MultiOutputRegressor 进行多步回归的特征选择。如何获得选定的特征及其特征重要性?
【发布时间】:2021-08-06 16:17:05
【问题描述】:

我想使用sklearn.feature_selection.SelectFromModel 来提取多步回归问题中的特征。回归问题使用MultiOutputRegressorRandomForestRegressor 组合来预测多个值。当我尝试使用SelectFromModel.get_support() 获取所选功能时,它会给出一个错误,表明我需要使一些feature_importances_ 可访问以使该方法正常工作。 可以访问feature_importances_MultiOutputRegressor,如this question 所示。但是我不确定如何将这些 feature_importances_ 正确传递给 SelectFromModel 类。

这是我到目前为止所做的:

# make sample data
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
 
X, y = make_regression(n_samples=100, n_features=100, n_targets=5)
print(X.shape, y.shape)
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, shuffle=True)
 
# get important features for prediction problem:
from sklearn.multioutput import MultiOutputRegressor
 
regr_multirf = MultiOutputRegressor(RandomForestRegressor(n_estimators = 100))
regr_multirf = regr_multirf.fit(X_train, y_train)
sel = SelectFromModel(regr_multirf, max_features= int(np.floor(X_train.shape[1] / 3.)))
sel.fit(X_train, y_train)
sel.get_support()
 
# to get feature_importances_ of Multioutputregressor use line:
regr_multirf.estimators_[1].feature_importances_

输出:

---------------------------------------------------------------------------
 
ValueError                                Traceback (most recent call last)
 
<ipython-input-72-a1d635ad4a34> in <module>()
      5 sel = SelectFromModel(regr_multirf, max_features= int(np.floor(X_train.shape[1] / 3.)))
      6 sel.fit(X_train, y_train)
----> 7 sel.get_support()
 
2 frames
 
/usr/local/lib/python3.7/dist-packages/sklearn/feature_selection/_from_model.py in _get_feature_importances(estimator, norm_order)
     30             "`feature_importances_` attribute. Either pass a fitted estimator"
     31             " to SelectFromModel or call fit before calling transform."
---> 32             % estimator.__class__.__name__)
     33 
     34     return importances
 
ValueError: The underlying estimator MultiOutputRegressor has no `coef_` or `feature_importances_` attribute. Either pass a fitted estimator to SelectFromModel or call fit before calling transform.
 

任何帮助和提示将不胜感激。

【问题讨论】:

标签: python scikit-learn


【解决方案1】:

在来自 sklearn 的 MultiOutputRegressors 中,每个目标都配备了自己的模型,如 documentation 中所述:“此策略包括为每个目标拟合一个回归器。”。这意味着您需要计算 MultiOutputRegressor 中每个随机森林回归量的特征重要性。 每个回归器的特征重要性不直接保存在 MultiOutputRegressor 中。相反,您可以从拟合的 MultiOutputRegressor 对象中提取每个回归量(或也称为估计量) regr_multirf.estimators_[# of regressor you want] 如果 regr_multirf 是您安装的 MultiOutputRegressor。

因此,您不需要SelectFromModel 来检索 MultiOutput sklearn 回归模型的特征重要性,而是直接使用每个估计器 as explained in this question,这个答案也非常依赖于此。您的方法仅适用于本质上可以预测多变量目标并且不会为每个目标训练单个模型的方法。

在您的情况下,您可以直接从拟合的回归器 regr_multirf 通过

检索特征重要性
# make sample data
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.feature_selection import SelectFromModel
import numpy as np
import pandas as pd
 
X, y = make_regression(n_samples=100, n_features=100, n_targets=5)
print(X.shape, y.shape)
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, shuffle=True)

regr_multirf = MultiOutputRegressor(RandomForestRegressor(n_estimators = 100))
regr_multirf = regr_multirf.fit(X_train, y_train)

# now extract the estimator from your regression model
# this estimator carries the feature importances
# you're interested in
# You can also loop the following code
# over all your targets

no_est = 0 # index of target you want feature importance for
# get estimator
est = regr_multirf.estimators_[0]
# get feature importances
feature_importances = pd.DataFrame(est.feature_importances_,
                                   columns=['importance']).sort_values('importance')
print(feature_importances)
feature_importances.plot(kind = 'barh')

输出:

【讨论】:

    猜你喜欢
    • 2020-09-11
    • 2019-08-13
    • 2020-07-31
    • 2022-01-19
    • 2018-08-26
    • 2011-01-10
    • 2016-01-25
    • 2021-02-11
    • 2021-04-16
    相关资源
    最近更新 更多