【问题标题】:Correlation cutoff in caret train preProcess插入符号训练预处理中的相关截止
【发布时间】:2018-06-14 17:40:12
【问题描述】:

我正在使用 r 中的 caret 包构建 C5.0 模型。

control <- trainControl(method = "repeatedcv", 
                    number = 10, 
                    repeats = 3, 
                    classProbs = TRUE, 
                    sampling = 'smote',
                    returnResamp="all",
                    summaryFunction = twoClassSummary)

grid <- expand.grid(.winnow = c(FALSE, TRUE), 
                 .trials = c(1, 5,10,15,20,25,30,40,45,50), 
                 .model= c("tree"),
                 .splits=c(2,5,10,15,20,25,50))

c5_model <- train(label ~ .,
              data = train,
              trControl = control, 
              method = c5info,
              tuneGrid = grid, 
              preProcess = c("center", "scale", "nzv","corr"),
              verbose = FALSE)

是否可以将自定义截止点传递给 preProcess 函数以进行相关性 - 例如 0.75 或我想要的任何点?

【问题讨论】:

    标签: r correlation r-caret preprocessor


    【解决方案1】:

    您可以在trainControl中指定预处理选项:

    library(caret)
    library(mlbench) #for the data
    data(Sonar)
    
    ctrl <-trainControl(method = "repeatedcv", 
                        number = 10, 
                        repeats = 3, 
                        classProbs = TRUE, 
                        sampling = 'smote',
                        returnResamp="all",
                        summaryFunction = twoClassSummary,
                        preProcOptions = list(cutoff = 0.75)) # all go in this list
    

    一些游侠模型:

    grid <- expand.grid(.mtry = c(2,5,10),
                        .min.node.size = 2,
                        .splitrule = "gini")
    
    fit_model <- train(Class ~ .,
                      data = Sonar,
                      trControl = ctrl, 
                      metric = "ROC",
                      method = "ranger",
                      tuneGrid = grid,
                      preProcess = c("center", "scale", "nzv","corr"),
                      verbose = FALSE)
    
    fit_model$preProcess
    #output
    Created from 679 samples and 60 variables
    
    Pre-processing:
      - centered (26)
      - ignored (0)
      - removed (34)
      - scaled (26)
    

    使用不同的截止值:

    ctrl2 <-trainControl(method = "repeatedcv", 
                        number = 10, 
                        repeats = 3, 
                        classProbs = TRUE, 
                        sampling = 'smote',
                        returnResamp="all",
                        summaryFunction = twoClassSummary,
                        preProcOptions = list(cutoff = 0.6))
    
    fit_model2 <- train(Class ~ .,
                       data = Sonar,
                       trControl = ctrl2, 
                       metric = "ROC",
                       method = "ranger",
                       tuneGrid = grid,
                       preProcess = c("center", "scale", "nzv","corr"),
                       verbose = FALSE)
    
    fit_model2$preProcess
    #output
    Created from 679 samples and 60 variables
    
    Pre-processing:
      - centered (23)
      - ignored (0)
      - removed (37)
      - scaled (23)
    

    更多列被删除

    当我们使用preProcOptions = list(cutoff = 0.95))

    fit_model3$preProcess
    #output
    Created from 679 samples and 60 variables
    
    Pre-processing:
      - centered (55)
      - ignored (0)
      - removed (5)
      - scaled (55)
    

    看起来很有效。

    同样,您可以传递任何其他预处理选项:

    ?caret::preProcess
    

    检查所有的

    【讨论】:

    • 初始问题已解决,但当模型运行时,它给出了错误“findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose) 中的错误:相关矩阵有一些缺失值。”如何指示函数进行成对完全相关?
    • 数据中有NA吗?如果是,请尝试删除它们或估算它们。问题是否仍然存在?
    • 将 na.action = na.pass 传递到火车上会切吗?
    • 我确定数据中没有 NA...奇怪吗?
    • 在没有数据的情况下很难对此类行为进行故障排除。我要做的是尝试在数据上只运行preProcess"corr",看看它是否有效(在train 之外) - 如果不是,我会尝试运行R 函数cor - 如果仍然不工作我会尝试找出原因 - 如果不能,我会在这里发布另一个问题,其中包含可以重现问题的数据子集。
    猜你喜欢
    • 2019-09-06
    • 2020-10-13
    • 2021-01-22
    • 2018-06-24
    • 2015-06-27
    • 1970-01-01
    • 2020-10-01
    • 2019-09-14
    • 2015-09-17
    相关资源
    最近更新 更多