混淆矩阵中的错误：数据和参考因素必须具有相同的水平数答案

【问题标题】：Error in Confusion Matrix : the data and reference factors must have the same number of levels混淆矩阵中的错误：数据和参考因素必须具有相同的水平数
【发布时间】：2015-07-12 04:17:54
【问题描述】：

我已经使用 R 插入符号训练了一个线性回归模型。我现在正在尝试生成混淆矩阵并不断收到以下错误：

confusionMatrix.default(pred, testing$Final) 中的错误：数据和参考因子的水平数必须相同

EnglishMarks <- read.csv("E:/Subject Wise Data/EnglishMarks.csv", 
header=TRUE)
inTrain<-createDataPartition(y=EnglishMarks$Final,p=0.7,list=FALSE)
training<-EnglishMarks[inTrain,]
testing<-EnglishMarks[-inTrain,]
predictionsTree <- predict(treeFit, testdata)
confusionMatrix(predictionsTree, testdata$catgeory)
modFit<-train(Final~UT1+UT2+HalfYearly+UT3+UT4,method="lm",data=training)
pred<-format(round(predict(modFit,testing)))              
confusionMatrix(pred,testing$Final)

生成混淆矩阵时出现错误。两个对象的级别相同。我无法弄清楚问题是什么。它们的结构和级别如下所示。他们应该是一样的。任何帮助将不胜感激，因为它让我崩溃了！！

> str(pred)
chr [1:148] "85" "84" "87" "65" "88" "84" "82" "84" "65" "78" "78" "88" "85"  
"86" "77" ...
> str(testing$Final)
int [1:148] 88 85 86 70 85 85 79 85 62 77 ...

> levels(pred)
NULL
> levels(testing$Final)
NULL

【问题讨论】：

线索就在您的 str 输出中。看看它们有什么不同？ pred 是类字符， testing$Final 是类整数。当您在此处调用格式pred<-format(round(predict(modFit,testing))) 时，它会将其转换为字符格式，就像提供列表时那样。你为什么要格式化？你可能应该计算模型的 RMSE 或 MAE，看看这个heuristically.wordpress.com/2013/07/12/…
@infominer 现在我已经使用 pred

标签： r machine-learning artificial-intelligence classification linear-regression

【解决方案1】：

我们在创建混淆矩阵时收到此错误。在创建混淆矩阵时，我们需要确保数据类型的预测值和实际值是“因子”。如果还有其他数据类型，我们必须在生成混淆矩阵之前将它们转换为“因子”数据因子。转换完成后，开始编译混淆矩阵。

pridicted <- factor(predict(treeFit, testdata))
real <- factor(testdata$catgeory)
my_data1 <- data.frame(data = pridicted, type = "prediction")
my_data2 <- data.frame(data = real, type = "real"
my_data3 <- rbind(my_data1,my_data2)
# Check if the levels are identical
identical(levels(my_data3[my_data3$type == "prediction",1]) , 
levels(my_data3[my_data3$type == "real",1]))
confusionMatrix(my_data3[my_data3$type == "prediction",1], 
my_data3[my_data3$type == "real",1],  dnn = c("Prediction", "Reference"))

【讨论】：

当链接到您自己的网站或内容（或您附属的内容）时，您must disclose your affiliation in the answer 以免被视为垃圾邮件。根据 Stack Exchange 政策，在您的用户名中包含与 URL 相同的文本或在您的个人资料中提及它不被视为充分披露。

【解决方案2】：

对于类似的错误，我强制 GLM 预测具有与因变量相同的类。

例如，GLM 将预测“数字”类。但是由于目标变量是一个“因素”类，我遇到了一个错误。

错误代码：

#Predicting using logistic model
glm.probs = predict(model_glm, newdata = test, type = "response")
test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")


#Checking the accuracy of the logistic model
    confusionMatrix(test$default,test$pred_glm)

结果：

Error: `data` and `reference` should be factors with the same levels.

更正的代码：

#Predicting using logistic model
    glm.probs = predict(model_glm, newdata = test, type = "response")
    test$pred_glm = ifelse(glm.probs > 0.5, "1", "0")
    test$pred_glm = as.factor(test$pred_glm)
    
#Checking the accuracy of the logistic model
confusionMatrix(test$default,test$pred_glm)

结果：

confusion Matrix and Statistics

          Reference
Prediction     0     1
         0   182  1317
         1   122 22335
                                          
               Accuracy : 0.9399          
                 95% CI : (0.9368, 0.9429)
    No Information Rate : 0.9873          
    P-Value [Acc > NIR] : 1

【讨论】：

【解决方案3】：

由于数据集中目标变量的 NA，我遇到了这个问题。如果您使用tidyverse，您可以使用drop_na 函数删除包含NA 的行。像这样：

iris %>% drop_na(Species) # Removes rows where Species column has NA
iris %>% drop_na() # Removes rows where any column has NA

对于基础 R，它可能看起来像：

iris[! is.na(iris$Species), ] # Removes rows where Species column has NA
na.omit(iris) # Removes rows where any column has NA

【讨论】：

【解决方案4】：

我有同样的问题。我猜它的发生是因为 data 参数没有像我预期的那样被转换为因素。试试：

confusionMatrix(pred,as.factor(testing$Final))

希望对你有帮助

【讨论】：

它对我有用。感谢分享:))

【解决方案5】：

您正在使用回归并尝试生成混淆矩阵。我相信混淆矩阵用于分类任务。通常人们使用 R^2 和 RMSE 指标。

【讨论】：

回归也可以用于分类任务。
只要它有 2 个类。

【解决方案6】：

confusionMatrix(pred,testing$Final)

每当您尝试构建混淆矩阵时，请确保真实值和预测值都是因子数据类型。

这里 pred 和 testing$Final 都必须是 factor 类型。不是检查级别，而是检查两个变量的类型，如果不是，则将它们转换为因子。

这里testing$final 是int 类型。将其转换为因子，然后构建混淆矩阵。

【讨论】：

【解决方案7】：

执行table(pred) 和table(testing$Final)。您将看到测试集中至少有一个数字从未被预测（即从未出现在pred 中）。这就是为什么“不同数量的级别”的意思。有一个自定义函数的例子可以解决这个问题here。

但是，我发现这个技巧很好用：

table(factor(pred, levels=min(test):max(test)), 
      factor(test, levels=min(test):max(test)))

它应该为您提供与函数完全相同的混淆矩阵。

【讨论】：

【解决方案8】：

以下内容似乎对我有用。这个想法类似于@nayriz：

confusionMatrix(
  factor(pred, levels = 1:148),
  factor(testing$Final, levels = 1:148)
)

关键是确保因子水平匹配。

【讨论】：