错误：`data` 和 `reference` 应该是具有相同级别随机森林的因子答案

【问题标题】：Error: `data` and `reference` should be factors with the same levels random forrest错误：`data` 和 `reference` 应该是具有相同级别随机森林的因子
【发布时间】：2021-09-20 00:48:37
【问题描述】：

这是我为作业做的代码。我似乎无法获得预测的混淆矩阵，请协助我排除代码故障或提出任何必要的建议。

set.seed(1234) test_index1 <-createDataPartition(water_potability3$Potability,p=0.1,list= FALSE)

water_potability_train <- water_potability3[test_index1,-c(4,6:9)]

water_potability3_test<- water_potability3[!1:nrow(water_potability3)%in%test_index1,-c(4,6:9)]

<- tuneRF(x=water_potability_train[,1:4],y=water_potability_train$Potability) (mintree <-trf[which.min(trf[,2]),1]) <-randomForest(x=water_potability_train[,-5],y=water_potability_train$Potability,mtry = mintree,importance = TRUE)

(rf_model,main="") (rf_model,main="")

preds_rf<- predict(rf_model,water_potability3_test[,-5])

table(preds_rf,water_potability3_test$Potability)

confusionmatrix(preds_rf,water_potability3_test$Potability)

每次我做一个混淆矩阵时，我都会收到错误“错误：data 和 reference 应该是具有相同水平的因素”

【问题讨论】：

标签： r

【解决方案1】：

由于您没有共享允许我重现错误的数据集，因此我将进行猜测并提供我自己会使用的解决方案。如果这对您不起作用，请提供一些数据并解释Potability 列包含的内容:-)

将数据随机拆分为训练和测试分区时，您可能无法从两个分区中的每个类中获得观察结果。例如。如果您有 10 个类，那么较小的测试分区中可能只有 8 个。然后，当您的模型预测训练分区中可用的其他两个类别之一时，这两个因素具有不同的水平。

所以我使用来自groupdata2 的partition() 和cat_col 参数，以确保每个类都在两个分区中表示（如果可能的话）。然后我使用cvms 中的confusion_matrix()，因为它允许两个因素的不同水平。

library(groupdata2)
library(cvms)
set.seed(1234) 

# Create list with two partitions
# where the ratio of classes in Potability are similar
parts <- partition(water_potability3[, -c(4,6:9)], 
                   p = 0.1, cat_col = "Potability")

# Extract the two partitions
water_potability3_test <- parts[[1]]
water_potability3_train <- parts[[2]]

# The modeling (haven't changed anything here)
trf <- tuneRF(x = water_potability_train[, 1:4],
              y = water_potability_train$Potability) 

(mintree <- trf[which.min(trf[, 2]), 1]) 

rf_model <- randomForest(
    x = water_potability_train[, -5],
    y = water_potability_train$Potability,
    mtry = mintree,
    importance = TRUE
)

preds_rf <- predict(rf_model, water_potability3_test[, -5])

# Create confusion matrix
conf_mat <- cvms::confusion_matrix(
    targets = water_potability3_test$Potability,
    predictions = preds_rf
)

# The basic confusion matrix table
conf_mat$Table

# Or as a plot
plot_confusion_matrix(conf_mat)

您还可以查看cvms::evaluate()，它有额外的评估指标。

了解更多

在此处了解有关 groupdata2 训练/测试分区功能的更多信息： https://cran.rstudio.com//web/packages/groupdata2/vignettes/cross-validation_with_groupdata2.html

这里有更多关于 cvms 混淆矩阵功能的信息： https://cran.r-project.org/web/packages/cvms/vignettes/Creating_a_confusion_matrix.html

【讨论】：