【问题标题】:Using adaboost within R's caret package在 R 的 caret 包中使用 adaboost
【发布时间】:2013-10-19 20:31:42
【问题描述】:

我使用ada R 包已经有一段时间了,最​​近又使用了caret。根据文档,carettrain() 函数应该有一个使用 ada 的选项。但是,当我使用与 ada() 通话中相同的语法时,插入符号对我很不利。

这是一个演示,使用wine 样本数据集。

library(doSNOW)
registerDoSNOW(makeCluster(2, type = "SOCK"))
library(caret)
library(ada)

wine = read.csv("http://www.nd.edu/~mclark19/learn/data/goodwine.csv")


set.seed(1234) #so that the indices will be the same when re-run
trainIndices = createDataPartition(wine$good, p = 0.8, list = F)
wanted = !colnames(wine) %in% c("free.sulfur.dioxide", "density", "quality",
                            "color", "white")

wine_train = wine[trainIndices, wanted]
wine_test = wine[-trainIndices, wanted]
cv_opts = trainControl(method="cv", number=10)


 ###now, the example that works using ada() 

 results_ada <- ada(good ~ ., data=wine_train, control=rpart.control
 (maxdepth=30, cp=0.010000, minsplit=20, xval=10), iter=500)

##this works, and gives me a confusion matrix.

results_ada
     ada(good ~ ., data = wine_train, control = rpart.control(maxdepth = 30, 
     cp = 0.01, minsplit = 20, xval = 10), iter = 500)
     Loss: exponential Method: discrete   Iteration: 500 
      Final Confusion Matrix for Data:
      Final Prediction
      etc. etc. etc. etc.

##Now, the calls that don't work. 

results_ada = train(good~., data=wine_train, method="ada",
control=rpart.control(maxdepth=30, cp=0.010000, minsplit=20, 
xval=10), iter=500)
   Error in train.default(x, y, weights = w, ...) : 
   final tuning parameters could not be determined
   In addition: Warning messages:
   1: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method,  :
    There were missing values in resampled performance measures.
   2: In train.default(x, y, weights = w, ...) :
    missing values found in aggregated results

 ###this doesn't work, either

results_ada = train(good~., data=wine_train, method="ada", trControl=cv_opts,
maxdepth=10, nu=0.1, iter=50)

  Error in train.default(x, y, weights = w, ...) : 
  final tuning parameters could not be determined
  In addition: Warning messages:
  1: In nominalTrainWorkflow(dat = trainData, info = trainInfo, method = method,  :
    There were missing values in resampled performance measures.
  2: In train.default(x, y, weights = w, ...) :
   missing values found in aggregated results

我猜这是 train() 需要额外的输入,但抛出的警告并没有给我任何关于缺少什么的提示。此外,我可能缺少一个依赖项,但没有提示应该有什么......

【问题讨论】:

    标签: r machine-learning data-mining classification adaboost


    【解决方案1】:

    查找 ?train 并搜索 ada 你会看到:

    Method Value: ada from package ada with tuning parameters: iter, maxdepth, nu (classification only)

    所以您一定缺少nu 参数和maxdepth 参数。

    【讨论】:

    • 看看我最后一次调用 train() ——它包括你提到的所有参数。 results_ada = train(good~., data=wine_train, method="ada", trControl=cv_opts, maxdepth=10, nu=0.1, iter=50)
    • 另外,我尝试取出trControl=cv_opts,但没有任何区别。仍然有错误。
    【解决方案2】:

    wine$good 中的数据类型是什么?如果是factor,请尝试明确提及它是这样的:

    wine$good <- as.factor(wine$factor)
    stopifnot(is.factor(wine$good))
    

    原因:通常,R 包需要一些帮助来区分分类和回归场景,并且插入符号内可能有一些通用代码可能会错误地将练习识别为回归问题(忽略 ada 仅进行分类的事实)。

    【讨论】:

    • 我尝试了您的建议(明确以 wine 开头作为一个因素),但我仍然收到错误...上述可重现的示例是否适用于您的系统?
    • 终于开始尝试了,对不起,我遇到了和你一样的错误,无法弄清楚。 method="rf" 工作正常,但我想这并不令人安慰,也就是说,你真的想要 method="ada"。
    • 啊哈,train(up ~ ., data=sym[,c(6, 14)], "ada"),没有任何关于参数的建议,工作!
    • 看来,Tuning parameter 'nu' was held constant at a value of 0.1 Accuracy was used to select the optimal model using the largest value. The final values used for the model were iter = 50, maxdepth = 1 and nu = 0.1.
    • 你能发布你的可重现的例子吗?另外,如果我确实想传递参数值怎么办?
    【解决方案3】:

    所以这似乎有效:

    wineTrainInd <- wine_train[!colnames(wine_train) %in% "good"]
    wineTrainDep <- as.factor(wine_train$good)
    
    results_ada = train(x = wineTrainInd, y = wineTrainDep, method="ada")
    
    results_ada
    Boosted Classification Trees 
    
    5199 samples
       9 predictors
       2 classes: 'Bad', 'Good' 
    
    No pre-processing
    Resampling: Bootstrapped (25 reps) 
    
    Summary of sample sizes: 5199, 5199, 5199, 5199, 5199, 5199, ... 
    
    Resampling results across tuning parameters:
    
      iter  maxdepth  Accuracy  Kappa  Accuracy SD  Kappa SD
      50    1         0.732     0.397  0.00893      0.0294  
      50    2         0.74      0.422  0.00853      0.0187  
      50    3         0.747     0.437  0.00759      0.0171  
      100   1         0.736     0.411  0.0065       0.0172  
      100   2         0.742     0.428  0.0075       0.0173  
      100   3         0.748     0.442  0.00756      0.0158  
      150   1         0.737     0.417  0.00771      0.0184  
      150   2         0.745     0.435  0.00851      0.0198  
      150   3         0.752     0.449  0.00736      0.016   
    
    Tuning parameter 'nu' was held constant at a value of 0.1
    Accuracy was used to select the optimal model using  the largest value.
    The final values used for the model were iter = 150, maxdepth = 3 and nu
     = 0.1.
    

    而原因在另一个问题中找到:

    caret::train: specify model-generation-parameters

    我认为当train 试图自己找到最佳调整参数时,您将调整参数作为参数传递。如果您确实想定义自己的参数,可以为网格搜索定义参数网格。

    【讨论】:

      【解决方案4】:

      请将参数包含在 tuneGrid 中

      Grid <- expand.grid(maxdepth=25,nu=2,iter=100)
      results_ada = train(good~., data=wine_train, method="ada",
      trControl=cv_opts,tuneGrid=Grid)
      

      这会起作用。

      【讨论】: