R：RF模型中的混淆矩阵返回错误：data`和`reference`应该是具有相同水平的因素答案

【问题标题】：R: Confusion matrix in RF model returns error: data` and `reference` should be factors with the same levelsR：RF模型中的混淆矩阵返回错误：data`和`reference`应该是具有相同水平的因素
【发布时间】：2023-03-21 09:03:01
【问题描述】：

我是 R 的新手，想解决二进制分类任务。

数据集的因子变量 LABELS 有 2 个类：第一个 - 0，第二个 - 1。下一张图片显示了它的实际头部： TimeDate 列 - 它只是索引。类分布定义为：

print("the number of values with % in factor variable - LABELS:")
percentage <- prop.table(table(dataset$LABELS)) * 100
cbind(freq=table(dataset$LABELS), percentage=percentage)

班级分配结果：

我也知道 Slot2 列是根据公式计算的：

Slot2 = Var3 - Slot3 + Slot4

分析相关矩阵后选择特征Var1，Var2，Var3，Var4。

在开始建模之前，我将数据集划分为训练和测试部分。我尝试使用下一个代码为二进制分类任务构建随机森林模型：

rf2 <- randomForest(LABELS ~ Var1 + Var2  + Var3 + Var4, 
                    data=train, ntree = 100,
                    mtry = 4, importance = TRUE)
print(rf2)

结果是：

  Call:
     randomForest(formula = LABELS ~ Var1 + Var2  + Var3 + Var4,
     data = train, ntree = 100,      mtry = 4, importance = TRUE) 

 Type of random forest: classification
 Number of trees: 100
 No. of variables tried at each split: 4

 OOB estimate of  error rate: 0.16%

 Confusion matrix:
           0      1 class.error
    0 164957    341 0.002062941
    1    280 233739 0.001196484

当我尝试做预测时：

# Prediction & Confusion Matrix - train data
p1 <- predict(rf2, train, type="prob")
print("Prediction & Confusion Matrix - train data")
confusionMatrix(p1, train$LABELS)

# # Prediction & Confusion Matrix - test data
p2 <- predict(rf2, test, type="prob")
print("Prediction & Confusion Matrix - test data")
confusionMatrix(p2, test$LABELS)

我在 R 中收到错误：

[1] "Prediction & Confusion Matrix - train data"
Error: `data` and `reference` should be factors with the same levels.
Traceback:

1. confusionMatrix(p1, train$LABELS)
2. confusionMatrix.default(p1, train$LABELS)
3. stop("`data` and `reference` should be factors with the same levels.", 
 .     call. = FALSE)

我也已经尝试通过使用以下问题中的想法来修复它：

但对我来说没有帮助。

你能帮我解决这个错误吗？

如有任何想法和 cmets，我将不胜感激。在此先感谢您。

【问题讨论】：

p1 看起来像什么？在没有看到您的数据的情况下，我猜测一个问题是您正在预测每个类的概率，而不是类本身。尝试更改为type = "response"，这将为每个观察提供一个最有可能的类别。我对混淆矩阵函数不是很熟悉，但猜测它需要类，而不是概率
@camille，谢谢你的建议。它修复了一个错误，但下一个问题似乎是在预测结果中我只收到了一个类，而不是现有的 2 个。
这可能是您的数据或模型的问题。请发布您的数据样本供人们使用
@camille，我为计算的 Slot2 列添加了实际数据集和公式。还为 LABELS（二进制）列添加了类分布。班级不平衡。数据集有超过 350k 行。当我尝试使用带有 method = "repeatedcv" 的 trainControl 函数来平衡它时，由于数据集的大小，我没有在有限的时间内收到结果（我认为）。谢谢）

标签： r random-forest r-caret confusion-matrix

【解决方案1】：

R 中的错误：

Error: `data` and `reference` should be factors with the same levels.

已通过更改 predict 函数中的 type 参数修复，正确代码：

# Prediction & Confusion Matrix - train data
p1 <- predict(rf2, train, type="response")
print("Prediction & Confusion Matrix - train data")
confusionMatrix(p1, train$LABELS)

@Camille，非常感谢）

【讨论】：