R：变量在节点和数据中具有不同的级别数答案

【问题标题】：R: variable has different number of levels in the node and in the dataR：变量在节点和数据中具有不同的级别数
【发布时间】：2018-10-07 09:11:36
【问题描述】：

我想使用bnlearn 进行具有朴素贝叶斯算法的分类任务。

我使用this 数据集进行测试。其中 3 个变量是连续的 ()V2、V4、V10)，其他变量是离散的。据我所知bnlearn 不能处理连续变量，因此需要将它们转换为因子或离散化。现在我想将所有特征转换为因子。但是，我遇到了一些问题。这是一个示例代码

dataSet <- read.csv("creditcard_german.csv", header=FALSE)
# ... split into trainSet and testSet ...

trainSet[] <- lapply(trainSet, as.factor)
testSet[] <- lapply(testSet, as.factor)

# V25 is the class variable
bn = naive.bayes(trainSet, training = "V25")
fitted = bn.fit(bn, trainSet, method = "bayes")
pred = predict(fitted , testSet)

...

对于此代码，我在调用 predict() 时收到一条错误消息

“V1”在节点和数据中的层数不同。

当我从训练集中删除那个 V1 时，我得到 V2 变量的相同错误。但是，当我进行分解dataSet [] <- lapply(dataSet, as.factor) 时，错误消失了，只是将其拆分为训练集和测试集。

那么，什么是优雅的解决方案呢？因为在现实世界的应用程序中，测试和训练集可能来自不同的来源。有什么想法吗？

【问题讨论】：

标签： r naivebayes bnlearn

【解决方案1】：

问题似乎是由于我的训练数据集和测试数据集具有不同的因子水平而引起的。我解决了这个问题，方法是使用rbind 命令组合两个不同的数据帧（训练和测试），应用as.factor 以获得完整数据集的完整因子集，然后将分解后的数据帧切回到单独的训练和测试数据集。

train <- read.csv("train.csv", header=FALSE)
test <- read.csv("test.csv", header=FALSE)
len_train = dim(train)[1]
len_test = dim(test)[1]

complete <- rbind(learn, test)    
complete[] <- lapply(complete, as.factor)
train = complete[1:len_train, ]
l = len_train+1
lf = len_train + len_test
test = complete[l:lf, ]

bn = naive.bayes(train, training = "V25")
fitted = bn.fit(bn, train, method = "bayes")
pred = predict(fitted , test)

我希望这会有所帮助。

【讨论】：

但是为什么测试数据应该具有训练集中所有级别的完整表示？不应该允许测试数据在训练数据中包含因子的子集吗？