【问题标题】:How to select top n number of variables using RFE in r?如何在 r 中使用 RFE 选择前 n 个变量?
【发布时间】:2021-05-21 09:06:41
【问题描述】:

经过一些预处理(包括OneHotEncoding.

问题:当我使用 sizes = c(15) 运行 rfe 时,它会产生 1563 Variable 结果。由于 63 变量的准确度略高,因此默认选择 63 Variable 结果。

想要而不是 63 获得 前 15 个变量,因为结果差异很小,但计算成本会更低。 p>

阅读下面的帖子后,我意识到我可以使用optVariables[1:15]

retrieve selected variables from caret recursive feature elimination (rfe) results

疑问:如果我使用 RFE_single_size$optVariables[1:15] 是从 63 返回的变量集中选择 top 15 vars 还是 15 Variables ?


control <- rfeControl(functions = rfFuncs, method = "cv", verbose = FALSE)

system.time(
  RFE_single_size <- rfe(x = train_both_sample,    #  selected_vars[, 1:44]
                 y = pull(Y_train), 
                 sizes = c(15),
                 rfeControl = control
                 )
)

RFE_single_size

RFE 结果

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold) 

Resampling performance over subset size:

 Variables Accuracy  Kappa AccuracySD KappaSD Selected
        15   0.9646 0.9293   0.007279 0.01451         
        63   0.9702 0.9404   0.006592 0.01315        *

The top 5 variables (out of 63):
   duration, age, campaign, euribor3m, nr.employed

我想将选择从 63 更改为 15 Variables 以确保我从 top 15 中选择 strong>15 Variables 已返回。

关于数据:数据取自开源“银行营销响应”分类问题。

更新:为代码(rmd)和数据 csv 文件添加了 github 链接:https://github.com/johnsnow09/RFE

str(train_both_sample)

'data.frame':   2884 obs. of  63 variables:
 $ age                          : num  31 45 33 47 30 43 23 42 43 37 ...
 $ job.admin.                   : num  0 0 0 0 0 0 0 0 0 1 ...
 $ job.blue.collar              : num  1 0 0 1 0 1 0 0 1 0 ...
 $ job.entrepreneur             : num  0 0 0 0 0 0 1 0 0 0 ...
 $ job.housemaid                : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.management               : num  0 0 0 0 0 0 0 1 0 0 ...
 $ job.retired                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.self.employed            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.services                 : num  0 1 0 0 0 0 0 0 0 0 ...
 $ job.student                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.technician               : num  0 0 1 0 1 0 0 0 0 0 ...
 $ job.unemployed               : num  0 0 0 0 0 0 0 0 0 0 ...
 $ job.unknown                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ marital.divorced             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ marital.married              : num  1 1 1 1 0 1 1 1 1 1 ...
 $ marital.single               : num  0 0 0 0 1 0 0 0 0 0 ...
 $ marital.unknown              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.basic.4y           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.basic.6y           : num  0 0 0 1 0 0 0 0 0 0 ...
 $ education.basic.9y           : num  1 0 0 0 0 0 0 0 1 0 ...
 $ education.high.school        : num  0 1 0 0 0 1 0 0 0 1 ...
 $ education.illiterate         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ education.professional.course: num  0 0 1 0 1 0 0 0 0 0 ...
 $ education.university.degree  : num  0 0 0 0 0 0 1 1 0 0 ...
 $ education.unknown            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ default.no                   : num  1 1 1 0 1 1 1 1 1 0 ...
 $ default.unknown              : num  0 0 0 1 0 0 0 0 0 1 ...
 $ default.yes                  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ housing.no                   : num  0 0 1 0 1 0 0 0 1 0 ...
 $ housing.unknown              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ housing.yes                  : num  1 1 0 1 0 1 1 1 0 1 ...
 $ loan.no                      : num  1 1 1 1 1 1 1 0 1 1 ...
 $ loan.unknown                 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ loan.yes                     : num  0 0 0 0 0 0 0 1 0 0 ...
 $ contact.cellular             : num  0 0 1 1 1 1 1 0 0 1 ...
 $ contact.telephone            : num  1 1 0 0 0 0 0 1 1 0 ...
 $ month.Mar                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.Apr                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.May                    : num  1 0 0 1 0 0 0 1 0 0 ...
 $ month.Jun                    : num  0 0 0 0 0 0 0 0 1 0 ...
 $ month.Jul                    : num  0 1 0 0 1 0 0 0 0 1 ...
 $ month.Aug                    : num  0 0 1 0 0 0 1 0 0 0 ...
 $ month.Sep                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.Oct                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ month.Nov                    : num  0 0 0 0 0 1 0 0 0 0 ...
 $ month.Dec                    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ day_of_week.fri              : num  0 0 1 0 0 0 0 1 0 0 ...
 $ day_of_week.mon              : num  0 0 0 1 0 1 0 0 0 1 ...
 $ day_of_week.thu              : num  0 0 0 0 0 0 1 0 1 0 ...
 $ day_of_week.tue              : num  1 1 0 0 1 0 0 0 0 0 ...
 $ day_of_week.wed              : num  0 0 0 0 0 0 0 0 0 0 ...
 $ duration                     : num  97 68 335 208 136 107 87 123 246 204 ...
 $ campaign                     : num  2 4 3 4 2 2 1 1 2 3 ...
 $ pdays                        : num  999 999 999 999 999 999 999 999 999 999 ...
 $ previous                     : num  0 0 0 1 0 0 0 0 0 0 ...
 $ poutcome.failure             : num  0 0 0 1 0 0 0 0 0 0 ...
 $ poutcome.nonexistent         : num  1 1 1 0 1 1 1 1 1 1 ...
 $ poutcome.success             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ emp.var.rate                 : num  1.1 1.4 1.4 -1.8 1.4 -0.1 1.4 1.1 1.4 1.4 ...
 $ cons.price.idx               : num  94 93.9 93.4 92.9 93.9 ...
 $ cons.conf.idx                : num  -36.4 -42.7 -36.1 -46.2 -42.7 -42 -36.1 -36.4 -41.8 -42.7 ...
 $ euribor3m                    : num  4.86 4.96 4.97 1.3 4.96 ...

【问题讨论】:

    标签: r classification r-caret rfe


    【解决方案1】:

    由于我没有收到来自 SO 帖子的任何答复,所以我尝试了两种方式:

    1. RFE_single_size$optVariables[1:15]

    然后vif 在这些变量上。

    1. 15 Variables 选择了所有 22 个 Imp 变量(22 个变量是由于 10 个折叠在每个折叠中具有不同的变量集),然后将其平均.

    rfe_selected_vars <- RFE_single_size$variables %>% 
      filter(Variables == 15) %>% 
      group_by(var) %>% 
      summarise(Overall = mean(Overall)) %>% 
      arrange(desc(Overall)) %>% 
      pull(var) 
    
    rfe_selected_vars
    
    [1] "duration"                      "age"                           "campaign"                     
     [4] "euribor3m"                     "nr.employed"                   "cons.conf.idx"                
     [7] "cons.price.idx"                "housing.no"                    "housing.yes"                  
    [10] "day_of_week.thu"               "job.admin."                    "marital.married"              
    [13] "loan.no"                       "marital.single"                "job.technician"               
    [16] "emp.var.rate"                  "day_of_week.mon"               "education.university.degree"  
    [19] "education.high.school"         "day_of_week.tue"               "day_of_week.wed"              
    [22] "education.professional.course"
    

    然后vif 在这些变量上。

    公式的选定变量:

    y ~ 持续时间 + 年龄 + 活动 + nr.employed + cons.conf.idx + cons.price.idx + Housing.yes + day_of_week.thu + job.admin。 + marital.married + loan.no +
    job.technician + education.high.school + day_of_week.tue + day_of_week.wed + 教育.专业.课程

    结论:

    vars 的总体列表在这两种情况下几乎没有什么不同,但 MARS 的结果与train 数据的 第二种情况 的结果大致相同,并且没有显示对 test 数据的任何改进。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-06-19
      • 2012-07-31
      • 1970-01-01
      • 2011-03-22
      相关资源
      最近更新 更多