【问题标题】：From cv.glmnet get confusion matrix从 cv.glmnet 得到混淆矩阵
【发布时间】：2021-12-22 09:42:54
【问题描述】：

问题说明

我正在比较几个模型，而我的数据集非常小，以至于我宁愿使用交叉验证也不愿拆分验证集。我的一个模型是使用glm“GLM”制作的，另一个是cv.glmnet“GLMNET”制作的。在伪代码中，我希望能够做到以下几点：

initialize empty 2x2 matrices GLM_CONFUSION and GLMNET_CONFUSION

# Cross validation loop
For each data point VAL in my dataset X:
  Let TRAIN be the rest of X (not including VAL)

  Train GLM on TRAIN, use it to predict VAL
  Depending on if it were a true positive, false positive, etc...
    add 1 to the correct entry in GLM_CONFUSION

  Train GLMNET on TRAIN, use it to predict VAL
  Depending on if it were a true positive, false positive, etc...
    add 1 to the correct entry in GLMNET_CONFUSION

这并不难，问题在于cv.glmnet已经在使用交叉验证推导出罚分的最佳值lambda。如果我可以让cv.glmnet 自动构建最佳模型的混淆矩阵会很方便，即我的代码应该如下所示：

initialize empty 2x2 matrices GLM_CONFUSION and GLMNET_CONFUSION

Train GLMNET on X using cv.glmnet
Set GLMNET_CONFUSION to be the confusion matrix of lambda.1se (or lambda.min)

# Cross validation loop
For each data point VAL in my dataset X:
  Let TRAIN be the rest of X (not including VAL)

  Train GLM on TRAIN, use it to predict VAL
  Depending on if it were a true positive, false positive, etc...
    add 1 to the correct entry in GLM_CONFUSION

这不仅方便，而且在某种程度上是必需的 - 有两种选择：

在交叉验证循环的每次迭代中，使用 cv.glmnet 在 TRAIN 上查找新的 lambda.1se。（即嵌套的交叉验证）
使用cv.glmnet 在X 上找到lambda.1se，然后“修复”该值并将其视为正常模型，以便在交叉验证循环期间进行训练。（两个平行的交叉验证）

第二个在哲学上是不正确的，因为这意味着 GLMNET 将拥有关于它试图在交叉验证循环中预测什么的信息。第一个需要很长时间——理论上我可以做到，但可能需要半个小时，我觉得应该有更好的方法。

到目前为止我所看到的

我查看了 cv.glmnet 的文档 - 看起来你不能按照我的要求做，但我对 R 和数据科学一般都很陌生，所以我很可能错过了一些东西.

我也在这个网站上看过一些帖子，乍一看似乎相关，但实际上要求的是不同的东西 - 例如，这个帖子：tidy predictions and confusion matrix with glmnet

上面的帖子看起来与我想要的相似，但它并不是我想要的——看起来他们正在使用predict.cv.glmnet 进行新的预测，然后创建它的混淆矩阵——而我想要在交叉验证步骤中做出的预测的混淆矩阵。

我希望有人能够做到这一点

解释是否以及如何按照所述创建混淆矩阵
表明除了我提出的两个方案之外还有第三种方案
- “手动实现cv.glmnet”不是一个可行的替代方案：P
最后声明我想要的东西是不可能的，我需要做我提到的两个替代方案之一。

其中任何一个都是这个问题的完美答案（尽管我希望选项 1！）

抱歉，如果我错过了一些简单的事情！

【问题讨论】：

这里有一个answer 来回答一个您可能会觉得有帮助的相关问题。一般来说，最好使用meta ML package 来处理模型的调整和评估。 caret 可能是 R 中最知名的此类软件包。尽管它已经过时了。较新的变体包括tidymodels 和mlr3。我个人使用 mlr3 atm。
这里是 mlr3 画廊mlr3gallery.mlr-org.com 的链接。搜索包含标签嵌套重采样的帖子。我使用 mlr3 是因为我认为它是所有可用于 R atm 的最灵活的变体。这需要一点时间来适应。如果您不打算经常做这种事情并且不需要调整 ML 管道，那么也许插入符号是最好的选择。
非常感谢您为我指明这个方向！这正是我所需要的 :) 在接下来的几天里，我将仔细研究这些资源，以尝试熟练掌握这些软件包。

标签： r

【解决方案1】：

感谢@missuse 的建议，我得到了一个适合我的解决方案！它对应于我帖子中的选项 2，此选项是使用 caret 包。

本质上，我们需要将自定义摘要函数附加到插入符号的模型训练器。在我开始工作之前，我主要是迷糊了几个小时 - 可能有更好的方法来做到这一点，我鼓励其他人发布其他答案，如果他们知道的话！我的代码在底部（它经过了轻微修改，使其不特定于我正在处理的任务）

希望如果有人有类似的问题，那么这会有所帮助。我发现对解决此问题有用的另一个资源是以下帖子：https://stats.stackexchange.com/questions/299653/caret-glmnet-vs-cv-glmnet，因为在其中您可以非常清楚地看到如何将对 cv.glmnet 的调用转换为对插入符号的 train 版本的 glmnet 的调用。

library(caret)

# Confusion Matrix of model outputs
CM <- function(model) {
  # Need to find index of best tune found by
  # cross validation
  idx <- 1
  for (i in 1:nrow(model$results)) {
    check <- model$results[i,]
    foundBest <- TRUE
    for (col in colnames(model$bestTune)) {
      if (check[,col] != model$bestTune[,col]) {
        foundBest <- FALSE
        break
      }
    }
    if (foundBest) {
      idx <- i
      break
    }
  }
  
  # They are averaged w.r.t. the number of folds (ctrl$number)
  # hence the multiplication
  c(
    model$results[idx,]$true_pos,
    model$results[idx,]$false_pos,
    model$results[idx,]$false_neg,
    model$results[idx,]$true_neg
  ) * model$control$number
}

# Summary function from the training to give confusion metric
SummaryFunc <- function (data, lev = NULL, model = NULL) { 

    # This puts our output in the right format
    out <- postResample(data$pred, data$obs)

    # Get the confusion matrix
    cm <- confusionMatrix(
      factor(data$pred, levels=c(0, 1)),
      factor(data$obs, levels=c(0, 1))
    )$table
    
    # Add those details to the output
    oldnames <- names(out)
    out <- c(out, cm[1, 1], cm[2, 1], cm[1, 2], cm[2, 2])
    names(out) <- c(oldnames, "true_pos", "false_pos", "false_neg", "true_neg")
    
    out
}


# 10-fold cross validation, as in cv.glmnet implementation
ctrl <- trainControl(
  method="cv",
  number=10,
  summaryFunction=SummaryFunc,
)


# Example of standard glm
our.glm <- train(
  your_formula,
  data=your_data,
  method="glm",
  family=gaussian(link="identity"),
  trControl=ctrl,
  metric="RMSE"
)

# Example of what used to be cv.glmnet
our.glmnet <- train(
  your_feature_matrix,
  your_label_matrix,
  method="glmnet",
  family=gaussian(link="identity"),
  trControl=ctrl,
  metric="RMSE",
  tuneGrid = expand.grid(
    alpha = 1,
    lambda = seq(0.001, 0.1, by=0.001)
  )
)

CM(our.glm)
CM(our.glmnet)

【讨论】：