如何在 r 中使用 RFE 选择前 n 个变量？答案

【问题标题】：How to select top n number of variables using RFE in r?如何在 r 中使用 RFE 选择前 n 个变量？
【发布时间】：2021-05-21 09:06:41
【问题描述】：

经过一些预处理（包括OneHotEncoding.

问题：当我使用 sizes = c(15) 运行 rfe 时，它会产生 15 和 63 Variable 结果。由于 63 变量的准确度略高，因此默认选择 63 Variable 结果。

想要而不是 63 获得 前 15 个变量，因为结果差异很小，但计算成本会更低。 p>

阅读下面的帖子后，我意识到我可以使用optVariables[1:15]

retrieve selected variables from caret recursive feature elimination (rfe) results

疑问：如果我使用 RFE_single_size$optVariables[1:15] 是从 63 返回的变量集中选择 top 15 vars 还是 15 Variables ?


control <- rfeControl(functions = rfFuncs, method = "cv", verbose = FALSE)

system.time(
  RFE_single_size <- rfe(x = train_both_sample,    #  selected_vars[, 1:44]
                 y = pull(Y_train), 
                 sizes = c(15),
                 rfeControl = control
                 )
)

RFE_single_size

RFE 结果

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
        15   0.9646 0.9293   0.007279 0.01451         
        63   0.9702 0.9404   0.006592 0.01315        *

The top 5 variables (out of 63):
   duration, age, campaign, euribor3m, nr.employed

我想将选择从 63 更改为 15 Variables 以确保我从 top 15 中选择 strong>15 Variables 已返回。

关于数据：数据取自开源“银行营销响应”分类问题。

更新：为代码（rmd）和数据 csv 文件添加了 github 链接：https://github.com/johnsnow09/RFE

str(train_both_sample)

'data.frame':   2884 obs. of  63 variables:
 $ age                          : num  31 45 33 47 30 43 23 42 43 37 ...
 $ job.admin.                   : num  0 0 0 0 0 0 0 0 0 1 ...
 $ job.blue.collar              : num  1 0 0 1 0 1 0 0 1 0 ...
 $ job.entrepreneur             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ job.housemaid                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.management               : num  0 0 0 0 0 0 0 1 0 0 ...
 $ job.retired                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.self.employed            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.services                 : num  0 1 0 0 0 0 0 0 0 0 ...
 $ job.student                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.technician               : num  0 0 1 0 1 0 0 0 0 0 ...
 $ job.unemployed               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.unknown                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ marital.divorced             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ marital.married              : num  1 1 1 1 0 1 1 1 1 1 ...
 $ marital.single               : num  0 0 0 0 1 0 0 0 0 0 ...
 $ marital.unknown              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.basic.4y           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.basic.6y           : num  0 0 0 1 0 0 0 0 0 0 ...
 $ education.basic.9y           : num  1 0 0 0 0 0 0 0 1 0 ...
 $ education.high.school        : num  0 1 0 0 0 1 0 0 0 1 ...
 $ education.illiterate         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.professional.course: num  0 0 1 0 1 0 0 0 0 0 ...
 $ education.university.degree  : num  0 0 0 0 0 0 1 1 0 0 ...
 $ education.unknown            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ default.no                   : num  1 1 1 0 1 1 1 1 1 0 ...
 $ default.unknown              : num  0 0 0 1 0 0 0 0 0 1 ...
 $ default.yes                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ housing.no                   : num  0 0 1 0 1 0 0 0 1 0 ...
 $ housing.unknown              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ housing.yes                  : num  1 1 0 1 0 1 1 1 0 1 ...
 $ loan.no                      : num  1 1 1 1 1 1 1 0 1 1 ...
 $ loan.unknown                 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ loan.yes                     : num  0 0 0 0 0 0 0 1 0 0 ...
 $ contact.cellular             : num  0 0 1 1 1 1 1 0 0 1 ...
 $ contact.telephone            : num  1 1 0 0 0 0 0 1 1 0 ...
 $ month.Mar                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.Apr                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.May                    : num  1 0 0 1 0 0 0 1 0 0 ...
 $ month.Jun                    : num  0 0 0 0 0 0 0 0 1 0 ...
 $ month.Jul                    : num  0 1 0 0 1 0 0 0 0 1 ...
 $ month.Aug                    : num  0 0 1 0 0 0 1 0 0 0 ...
 $ month.Sep                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.Oct                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.Nov                    : num  0 0 0 0 0 1 0 0 0 0 ...
 $ month.Dec                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ day_of_week.fri              : num  0 0 1 0 0 0 0 1 0 0 ...
 $ day_of_week.mon              : num  0 0 0 1 0 1 0 0 0 1 ...
 $ day_of_week.thu              : num  0 0 0 0 0 0 1 0 1 0 ...
 $ day_of_week.tue              : num  1 1 0 0 1 0 0 0 0 0 ...
 $ day_of_week.wed              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ duration                     : num  97 68 335 208 136 107 87 123 246 204 ...
 $ campaign                     : num  2 4 3 4 2 2 1 1 2 3 ...
 $ pdays                        : num  999 999 999 999 999 999 999 999 999 999 ...
 $ previous                     : num  0 0 0 1 0 0 0 0 0 0 ...
 $ poutcome.failure             : num  0 0 0 1 0 0 0 0 0 0 ...
 $ poutcome.nonexistent         : num  1 1 1 0 1 1 1 1 1 1 ...
 $ poutcome.success             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ emp.var.rate                 : num  1.1 1.4 1.4 -1.8 1.4 -0.1 1.4 1.1 1.4 1.4 ...
 $ cons.price.idx               : num  94 93.9 93.4 92.9 93.9 ...
 $ cons.conf.idx                : num  -36.4 -42.7 -36.1 -46.2 -42.7 -42 -36.1 -36.4 -41.8 -42.7 ...
 $ euribor3m                    : num  4.86 4.96 4.97 1.3 4.96 ...

【问题讨论】：

标签： r classification r-caret rfe

【解决方案1】：

由于我没有收到来自 SO 帖子的任何答复，所以我尝试了两种方式：

RFE_single_size$optVariables[1:15]

然后vif 在这些变量上。

为 15 Variables 选择了所有 22 个 Imp 变量（22 个变量是由于 10 个折叠在每个折叠中具有不同的变量集），然后将其平均.

rfe_selected_vars <- RFE_single_size$variables %>% 
  filter(Variables == 15) %>% 
  group_by(var) %>% 
  summarise(Overall = mean(Overall)) %>% 
  arrange(desc(Overall)) %>% 
  pull(var) 

rfe_selected_vars

[1] "duration"                      "age"                           "campaign"                     
 [4] "euribor3m"                     "nr.employed"                   "cons.conf.idx"                
 [7] "cons.price.idx"                "housing.no"                    "housing.yes"                  
[10] "day_of_week.thu"               "job.admin."                    "marital.married"              
[13] "loan.no"                       "marital.single"                "job.technician"               
[16] "emp.var.rate"                  "day_of_week.mon"               "education.university.degree"  
[19] "education.high.school"         "day_of_week.tue"               "day_of_week.wed"              
[22] "education.professional.course"

然后vif 在这些变量上。

公式的选定变量：

y ~ 持续时间 + 年龄 + 活动 + nr.employed + cons.conf.idx + cons.price.idx + Housing.yes + day_of_week.thu + job.admin。 + marital.married + loan.no +
job.technician + education.high.school + day_of_week.tue + day_of_week.wed + 教育.专业.课程

结论：

vars 的总体列表在这两种情况下几乎没有什么不同，但 MARS 的结果与train 数据的 第二种情况 的结果大致相同，并且没有显示对 test 数据的任何改进。

【讨论】：