插入符号训练预处理中的相关截止答案

【问题标题】：Correlation cutoff in caret train preProcess插入符号训练预处理中的相关截止
【发布时间】：2018-06-14 17:40:12
【问题描述】：

我正在使用 r 中的 caret 包构建 C5.0 模型。

control <- trainControl(method = "repeatedcv", 
                    number = 10, 
                    repeats = 3, 
                    classProbs = TRUE, 
                    sampling = 'smote',
                    returnResamp="all",
                    summaryFunction = twoClassSummary)

grid <- expand.grid(.winnow = c(FALSE, TRUE), 
                 .trials = c(1, 5,10,15,20,25,30,40,45,50), 
                 .model= c("tree"),
                 .splits=c(2,5,10,15,20,25,50))

c5_model <- train(label ~ .,
              data = train,
              trControl = control, 
              method = c5info,
              tuneGrid = grid, 
              preProcess = c("center", "scale", "nzv","corr"),
              verbose = FALSE)

是否可以将自定义截止点传递给 preProcess 函数以进行相关性 - 例如 0.75 或我想要的任何点？

【问题讨论】：

标签： r correlation r-caret preprocessor

【解决方案1】：

您可以在trainControl中指定预处理选项：

library(caret)
library(mlbench) #for the data
data(Sonar)

ctrl <-trainControl(method = "repeatedcv", 
                    number = 10, 
                    repeats = 3, 
                    classProbs = TRUE, 
                    sampling = 'smote',
                    returnResamp="all",
                    summaryFunction = twoClassSummary,
                    preProcOptions = list(cutoff = 0.75)) # all go in this list

一些游侠模型：

grid <- expand.grid(.mtry = c(2,5,10),
                    .min.node.size = 2,
                    .splitrule = "gini")

fit_model <- train(Class ~ .,
                  data = Sonar,
                  trControl = ctrl, 
                  metric = "ROC",
                  method = "ranger",
                  tuneGrid = grid,
                  preProcess = c("center", "scale", "nzv","corr"),
                  verbose = FALSE)

fit_model$preProcess
#output
Created from 679 samples and 60 variables

Pre-processing:
  - centered (26)
  - ignored (0)
  - removed (34)
  - scaled (26)

使用不同的截止值：

ctrl2 <-trainControl(method = "repeatedcv", 
                    number = 10, 
                    repeats = 3, 
                    classProbs = TRUE, 
                    sampling = 'smote',
                    returnResamp="all",
                    summaryFunction = twoClassSummary,
                    preProcOptions = list(cutoff = 0.6))

fit_model2 <- train(Class ~ .,
                   data = Sonar,
                   trControl = ctrl2, 
                   metric = "ROC",
                   method = "ranger",
                   tuneGrid = grid,
                   preProcess = c("center", "scale", "nzv","corr"),
                   verbose = FALSE)

fit_model2$preProcess
#output
Created from 679 samples and 60 variables

Pre-processing:
  - centered (23)
  - ignored (0)
  - removed (37)
  - scaled (23)

更多列被删除

当我们使用preProcOptions = list(cutoff = 0.95))

fit_model3$preProcess
#output
Created from 679 samples and 60 variables

Pre-processing:
  - centered (55)
  - ignored (0)
  - removed (5)
  - scaled (55)

看起来很有效。

同样，您可以传递任何其他预处理选项：

?caret::preProcess

检查所有的

【讨论】：

初始问题已解决，但当模型运行时，它给出了错误“findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose) 中的错误：相关矩阵有一些缺失值。”如何指示函数进行成对完全相关？
数据中有NA吗？如果是，请尝试删除它们或估算它们。问题是否仍然存在？
将 na.action = na.pass 传递到火车上会切吗？
我确定数据中没有 NA...奇怪吗？
在没有数据的情况下很难对此类行为进行故障排除。我要做的是尝试在数据上只运行preProcess 和"corr"，看看它是否有效（在train 之外） - 如果不是，我会尝试运行R 函数cor - 如果仍然不工作我会尝试找出原因 - 如果不能，我会在这里发布另一个问题，其中包含可以重现问题的数据子集。