R randomForest 用于分类答案

【问题标题】：R randomForest for classificationR randomForest 用于分类
【发布时间】：2012-12-18 02:00:03
【问题描述】：

我正在尝试使用 randomForest 进行分类，但我反复收到一条错误消息，似乎没有明显的解决方案（randomForest 在过去对我进行回归时效果很好）。我在下面粘贴了我的代码。 “成功”是一个因素，所有因变量都是数字。关于如何正确运行此分类的任何建议？

> rf_model<-randomForest(success~.,data=data.train,xtest=data.test[,2:9],ytest=data.test[,1],importance=TRUE,proximity=TRUE)

Error in randomForest.default(m, y, ...) : 
  NA/NaN/Inf in foreign function call (arg 1)

另外，这里是数据集的一个示例：

头部（数据）

success duration  goal reward_count updates_count comments_count backers_count     min_reward_level max_reward_level
True 20.00000  1500           10            14              2            68                1             1000
True 30.00000  3000           10             4              3            48                5             1000
True 24.40323 14000           23             6             10           540                5             1250
True 31.95833 30000            9            17              7           173                1            10000
True 28.13211  4000           10            23             97          2936               10              550
True 30.00000  6000           16            16            130          2043               25              500

【问题讨论】：

没有完全可重现的例子，没有。至少，我会 (1) 检查您的数据中是否没有 NA 值，并 (2) 运行 traceback() 以查看您是否可以获得有关错误发生位置的更多详细信息。
尝试将“成功”值更改为物种名称而不是“真”。你能告诉我们 srt(data) 的输出吗？？
看来您已经接受了答案；我遇到了这个问题，发现对于分类，这是因为我的响应变量是 chr 类。要么做data$var <- as.factor(data$var)，要么用randomForest(as.factor(data$var) ~ ., ...)预测为我解决了这个问题。
使用lapply(your_data, class) 并检查是否有“字符”类的观察结果

标签： r classification data-analysis random-forest

【解决方案1】：

您是否尝试过对相同数据进行回归？如果没有，请检查数据中的“Inf”值，并在删除 NA 和 NaN 后尝试将其删除。您可以在下面找到有关删除 Inf 的有用信息，

R is there a way to find Inf/-Inf values?

例子，

Class V1    V2  V3  V4  V5  V6  V7  V8  V9
1   11  Inf 4   232 23  2   2   34  0.205567767
1   11  123 4   232 23  1   2   34  0.162357601
1   13  123 4   232 23  1   2   34  -0.002739357
1   13  123 4   232 23  1   2   34  0.186989878
2   67  14  4   232 67  1   2   34  0.109398677
2   67  14  4   232 67  2   2   34  0.18491187
2   67  14  4   232 34  2   2   34  0.098728256
2   44  769.03  4   21  34  2   2   34  0.204405869
2   44  34  4   11  34  1   2   34  0.218426408

# When Classification was performed, following error pops out.
rf_model<-randomForest(as.factor(Class)~.,data=data,importance=TRUE,proximity=TRUE)
Error in randomForest.default(m, y, ...) : 
NA/NaN/Inf in foreign function call (arg 1)

# Regression was performed, following error pops out.
rf_model<-randomForest(Class~.,data=data,importance=TRUE,proximity=TRUE)
Error in randomForest.default(m, y, ...) : 
NA/NaN/Inf in foreign function call (arg 1)

因此，请非常仔细地检查您的数据。另外：警告信息：在 randomForest.default(m, y, ...) ：响应具有五个或更少的唯一值。您确定要进行回归吗？

【讨论】：

【解决方案2】：

这是因为您的变量之一有超过 32 个级别。级别意味着一个变量的不同值。删除该变量，然后重试。

【讨论】：

【解决方案3】：

除了存在 NA 等明显事实之外，此错误几乎总是由数据集中存在字符特征类型引起的。理解这一点的方法是考虑随机森林的真正作用。您正在按功能对数据集进行分区。因此，如果其中一个特征是字符向量，您将如何划分数据集？您需要类别来对数据进行分区。有多少“男性”与“女性” - 类别...

对于年龄或价格等数字特征，您可以通过分桶创建类别；大于某个年龄，小于某个价格等。你不能用纯粹的性格特征来做到这一点。因此，您需要将它们作为数据集中的因素。

【讨论】：

【解决方案4】：

一般来说，您收到此错误消息的主要原因有 2 个：

如果数据框包含字符向量列而不是因子。只需将您的字符列转换为一个因子

2.如果数据包含错误值，应用随机森林也会产生此错误。头部不会显示异常值。例如：

x = rep(x = sample(c(0,1)), times = 24)

y = c(sample.int(n=50,size = 40),Inf,Inf)

df = data.frame(col1 = x , col2 = y )

head(df)
    col1 col2
>  1    1   26
>  2    0   33
>  3    1   23
>  4    0   21
>  5    1   45
>  6    0   27

现在对 df 应用 randomForest 会导致同样的错误：

model = randomForest(data = df , col2 ~ col1 , ntree = 10)

randomForest.default(m, y, ...) 中的错误：外部函数调用中的 NA/NaN/Inf (arg 2)

解决方案：让我们识别 df 中的错误值。如上所述 is.finite() 方法检查输入向量是否包含正确的有限值。例如：

is.finite(c(5,6,1000000,NaN,Inf))
[1] 真真真假假

现在让我们识别数据框中包含错误值的列并计算它们。

sum(!is.finite(as.vector(df[,names(df) %in% c("col2")])))
[1] 4
sum(!is.finite(as.vector(df[,names(df) %in% c("col1")])))
[1] 0

让我们放弃这些记录，只记录好的记录：

df1 =df[is.finite(as.vector(df[,names(df) %in% c("col2")])) &
is.finite(as.vector(df[,names(df) %in% c("col1")])) , ]

然后再次运行 randomForest：

model1 = randomForest(data = df1, col2 ~ col1, ntree = 10)
致电：
随机森林（公式 = col2 ~ col1，数据 = df1，ntree = 10）

【讨论】：

【解决方案5】：

只需将所有列转换为因子，即可避免此错误。即使我面临这个错误。该列，特别是未转换为因子的列。我为此专门写了 as.factor 。最后我的代码成功了。

【讨论】：