randomForest 错误：预测变量中不允许 NA（但数据中没有 NA）答案

【问题标题】：randomForest Error: NA not permitted in predictors (but no NAs in data)randomForest 错误：预测变量中不允许 NA（但数据中没有 NA）
【发布时间】：2014-07-20 12:30:39
【问题描述】：

所以我试图在 R 中运行使用“randomForest”方法的“genie3”算法（参考：http://homepages.inf.ed.ac.uk/vhuynht/software.html）。

我遇到以下错误：

> weight.matrix<-get.weight.matrix(tmpLog2FC, input.idx=1:4551)
Starting RF computations with 1000 trees/target gene,
and 67 candidate input genes/tree node
Computing gene 1/11805
Show Traceback

Rerun with Debug
Error in randomForest.default(x, y, mtry = mtry, ntree = nb.trees, importance = TRUE,  : 
NA not permitted in predictors

所以我检查了我的数据中是否存在 NA，但没有：

> NAs<-sapply(tmpLog2FC, function(x) sum(is.na(x)))
> length(which(NAs!=0))
[1] 0

然后我尝试编辑特定的“get.weight.matrix()”函数以通过更改此行来省略 NA（以防万一）：

rf <- randomForest(x, y, mtry=mtry, ntree=nb.trees, importance=TRUE, ...)

收件人：

rf <- randomForest(x, y, mtry=mtry, ntree=nb.trees, importance=TRUE, na.action=na.omit)

然后我获取了代码，并通过自己调用它（并显示实际脚本）来仔细检查它是否包含更改：

    }
    target.gene.name <- gene.names[target.gene.idx]
    # remove target gene from input genes
    these.input.gene.names <- setdiff(input.gene.names, target.gene.name)
    x <- expr.matrix[,these.input.gene.names]
    y <- expr.matrix[,target.gene.name]
    rf <- randomForest(x, y, mtry=mtry, ntree=nb.trees, importance=TRUE, na.action=na.omit)

但是在尝试重新运行时，我得到了同样的错误：

Error in randomForest.default(x, y, mtry = mtry, ntree = nb.trees, importance = TRUE,  : 
NA not permitted in predictors

有没有人遇到过类似的情况？关于我能做什么的任何想法？

提前致谢。

*编辑：按照建议，我用调试重新运行：

> weight.matrix<-get.weight.matrix(tmpLog2FC, input.idx=1:4551)
Starting RF computations with 1000 trees/target gene,
and 67 candidate input genes/tree node
Computing gene 1/11805
Error in randomForest.default(x, y, mtry = mtry, ntree = nb.trees, importance = TRUE,  : 
NA not permitted in predictors
Called from: randomForest(x, y, mtry = mtry, ntree = nb.trees, importance = TRUE, 
na.action = na.omit)
Browse[1]> 
>

调试显示我怀疑的行引发了错误，但它以“na.action=na.omit”的编辑形式显示。我更加困惑。没有 NA 的数据集在运行允许省略 NA 的代码时如何显示此错误？

【问题讨论】：

还有一件事要尝试：使用debug() 逐步执行函数并检查沿途各个点的NA，直到调用randomForest 之前。您可能会发现一些 NA 以这种方式蔓延。
感谢您的建议。我用调试运行编辑了我的帖子。
你误解了我的意思。在某个地方，不知何故，正在引入 NA。成功调试的第一步是开始相信错误信息。我们只是对发生这种情况的地方还不够努力。我建议您一次通过 get.weight.matrix 一行，在进行时多次测试 expr.matrix 的 NA。此外，您需要在randomForest 调用之前检查x。
哦！我窥探的是什么！一个Inf！看……你完全可以通过我描述的调试过程看到它的后果。该函数转置矩阵，然后缩放列。但是Inf 将导致平均值为Inf 和sd 为NaN。将两者分开，您将获得 NA。这意味着在函数expr.matrix 确实中进一步包含 NA，正如 R 告诉你的那样。
mean 和 sd 都有一个 na.rm 参数，该参数将删除 NAs，但对 Inf 没有任何作用。即使您从mean 和sd 计算中删除Infs，Inf 仍然包含在原始向量的算术中，并且会产生另一个Inf，其效果可能仍然是坏的。您最好先删除带有Inf 的案例（is.finite），或者更好的是，首先调查它们为何存在，以及对您除以的数据使用此过程是否有意义零。

标签： r machine-learning bioinformatics random-forest na

【解决方案1】：

您可以使用以下命令找出行列表，如果任何预测变量没有值，它将在其中显示。

数据[!complete.cases(数据),]

仔细检查行，就像在我的情况下，没有值的行 ",,,,,,,,," （在我的文件列中，预测变量以逗号分隔）显示为 NA在 RF 运行时。

您可以删除这些行。

谢谢

【讨论】：