{caret}xgTree：重新采样的性能度量中存在缺失值答案

【问题标题】：{caret}xgTree: There were missing values in resampled performance measures{caret}xgTree：重新采样的性能度量中存在缺失值
【发布时间】：2018-10-03 16:49:48
【问题描述】：

我正在尝试在this dataset 上运行 5 倍 XGBoost 模型。当我运行以下代码时：

  train_control<- trainControl(method="cv", 
                           search = "random", 
                           number=5,
                           verboseIter=TRUE)

  # Train Models 
  xgb.mod<- train(Vote_perc~.,
              data=forkfold, 
              trControl=train_control, 
              method="xgbTree", 
              family=binomial())

我收到以下警告：

Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,  :
  There were missing values in resampled performance measures.

此外，“预测”功能运行，但所有预测都是相同的数字。我怀疑这是一个仅拦截模型，但我不确定。另外，当我删除

search="random"

参数，它运行正常。我想运行随机搜索，以便我可以隔离哪些超参数可能最有效，但每次尝试时，我都会收到警告。我错过了什么？谢谢！

【问题讨论】：

我很确定 caret train 和 xgoost 都没有参数 family=binomial()。也许你的意思是：objective: "binary:logistic" 还有Vote_perc 似乎不是类？你能详细说明你想做什么吗？
@missuse Vote_perc 列是我试图预测的投票百分比。我的语法可能是错误的。最初它是一个 beta 回归，我已尽力将其从纯统计模型转换为 ML 模型，但我是 ML 新手。
我仔细检查了 objective: "binary:logistic" 是 xgboost 表示法，插入符号确实使用了我上面的 family=binomial()。 FWIW，将其更改为 family=gaussian() 并没有解决问题。
也许您在运行glm 时在插入符号train 中看到了参数family=binomial()？这是因为插入符号将参数传递给底层函数。您的目标变量不适合分类 - 如果适合您，请考虑将其更改为 0 和非零。如果不考虑执行回归。尽管考虑到零的无处不在，我怀疑它能否表现良好。
我将其更改为二进制分类响应变量并且警告仍然存在。 predict() 函数对所有观察结果给出相同预测的现象也持续存在。

标签： r r-caret xgboost hyperparameters

【解决方案1】：

这是您可以对数据执行的一种方法：

加载数据：

forkfold  <- read.csv("forkfold.csv", row.names = 1)

这里的问题是结果变量在 97% 的情况下为 0，而在其余 3% 的情况下非常接近于零。

length(forkfold$Vote_perc)
#output
7069

sum(forkfold$Vote_perc != 0)
#output 
212

您将其描述为分类问题，我将通过将其转换为二元问题来处理它：

forkfold$Vote_perc <- ifelse(forkfold$Vote_perc != 0,
                             "one",
                             "zero")

由于使用Accuracy 的集合高度不平衡，因为选择指标是不可能的。在这里，我将尝试通过定义自定义评估函数来最大化Sensitivity + Specificity，如here 所述：

fourStats <- function (data, lev = levels(data$obs), model = NULL) {
  out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
  coords <- matrix(c(1, 1, out["Spec"], out["Sens"]), 
                   ncol = 2, 
                   byrow = TRUE)
  colnames(coords) <- c("Spec", "Sens")
  rownames(coords) <- c("Best", "Current")
  c(out, Dist = dist(coords)[1])
}

我会在trainControl中指定这个函数：

train_control <- trainControl(method = "cv", 
                              search = "random", 
                              number = 5,
                              verboseIter=TRUE,
                              classProbs = T,
                              savePredictions = "final",
                              summaryFunction = fourStats)

set.seed(1)
xgb.mod <- train(Vote_perc~.,
                 data = forkfold, 
                 trControl = train_control, 
                 method = "xgbTree", 
                 tuneLength = 50,
                 metric = "Dist",
                 maximize = FALSE,
                 scale_pos_weight = sum(forkfold$Vote_perc == "zero")/sum(forkfold$Vote_perc == "one"))

我将在 fourStats 汇总函数中使用之前定义的 Dist 指标。这个指标应该最小化，所以maximize = FALSE。我将在调谐空间上使用随机搜索，并测试 50 组随机超参数值 (tuneLength = 50)。

我还设置了 xgboost 函数的scale_pos_weight 参数。来自?xgboost的帮助：

scale_pos_weight, [default=1] 控制正负的平衡负权重，对不平衡的类很有用。一个典型值考虑： sum(negative cases) / sum(positive cases) 见参数调整以进行更多讨论。另请参阅 Higgs Kaggle 竞赛演示示例：R、py1、py2、py3

我定义为推荐sum(negative cases) / sum(positive cases)

在模型训练之后，它会选择一些最小化Dist 的炒作参数。

要评估保留预测的混淆矩阵：

caret::confusionMatrix(xgb.mod$pred$pred, xgb.mod$pred$obs)

Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one   195  430
      zero   17 6427

               Accuracy : 0.9368          
                 95% CI : (0.9308, 0.9423)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.4409          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.91981         
            Specificity : 0.93729         
         Pos Pred Value : 0.31200         
         Neg Pred Value : 0.99736         
             Prevalence : 0.02999         
         Detection Rate : 0.02759         
   Detection Prevalence : 0.08841         
      Balanced Accuracy : 0.92855         

       'Positive' Class : one

我会说它没那么糟糕。

如果您调整预测的截止阈值，您可以做得更好，在调整过程中如何做到这一点在here 中进行了描述。您还可以使用折叠预测来调整截止阈值。在这里，我将展示如何使用 pROC 库：

library(pROC)

plot(roc(xgb.mod$pred$obs, xgb.mod$pred$one),
     print.thres = TRUE)

图像上显示的阈值最大化Sens + Spec：

使用此阈值评估折叠性能：

caret::confusionMatrix(ifelse(xgb.mod$pred$one > 0.369, "one", "zero"),
                       xgb.mod$pred$obs)
#output
Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one   200  596
      zero   12 6261

               Accuracy : 0.914           
                 95% CI : (0.9072, 0.9204)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 1               

                  Kappa : 0.3668          
 Mcnemar's Test P-Value : <2e-16          

            Sensitivity : 0.94340         
            Specificity : 0.91308         
         Pos Pred Value : 0.25126         
         Neg Pred Value : 0.99809         
             Prevalence : 0.02999         
         Detection Rate : 0.02829         
   Detection Prevalence : 0.11260         
      Balanced Accuracy : 0.92824         

       'Positive' Class : one

因此，在 212 个非零实体中，您检测到 200 个。

为了更好地执行，您可以尝试对数据进行预处理。或者使用更好的超参数搜索例程，例如用于mlr 的mlrMBO 包。或者也许改变学习者（我怀疑你可以在这里超越 xgboost）。

另外请注意，如果获得高灵敏度不是最重要的，也许使用“Kappa”作为选择指标可能会提供更令人满意的模型。

最后，让我们使用已选择的参数检查默认 scale_pos_weight = 1 模型的性能：

set.seed(1)
xgb.mod2 <- train(Vote_perc~.,
                  data = forkfold, 
                  trControl = train_control, 
                  method = "xgbTree", 
                  tuneGrid = data.frame(nrounds = 498,
                                        max_depth = 3,
                                        eta = 0.008833468,
                                        gamma = 4.131242,
                                        colsample_bytree = 0.4233169,
                                        min_child_weight = 3,
                                        subsample = 0.6212512),
                  metric = "Dist",
                  maximize = FALSE,
                  scale_pos_weight = 1)

caret::confusionMatrix(xgb.mod2$pred$pred, xgb.mod2$pred$obs)
#output
Confusion Matrix and Statistics

          Reference
Prediction  one zero
      one    94   21
      zero  118 6836

               Accuracy : 0.9803          
                 95% CI : (0.9768, 0.9834)
    No Information Rate : 0.97            
    P-Value [Acc > NIR] : 3.870e-08       

                  Kappa : 0.5658          
 Mcnemar's Test P-Value : 3.868e-16       

            Sensitivity : 0.44340         
            Specificity : 0.99694         
         Pos Pred Value : 0.81739         
         Neg Pred Value : 0.98303         
             Prevalence : 0.02999         
         Detection Rate : 0.01330         
   Detection Prevalence : 0.01627         
      Balanced Accuracy : 0.72017         

       'Positive' Class : one

在默认阈值 0.5 时差很多。

以及最优阈值：

plot(roc(xgb.mod2$pred$obs, xgb.mod2$pred$one),
     print.thres = TRUE)

0.037 与我们推荐设置 scale_pos_weight 时获得的 0.369 相比。然而，使用最优阈值时，两种方法都会产生相同的预测。

【讨论】：