在R中使用randomForest循环遍历变量的值答案

【问题标题】：Looping through values of a variable using randomForest in R在R中使用randomForest循环遍历变量的值
【发布时间】：2018-05-06 02:32:22
【问题描述】：

我一直在尝试为不同的值运行 randomForest 模型。我习惯于在 STATA 中使用“foreach”命令 - 但 R 的工作方式似乎有所不同。

我已经搜索了很长一段时间都没有成功，寻找一些非常简单的东西（我认为）。这是我正在尝试做的事情：

我正在运行以下 randomForest 模型：

modelRandom = randomForest(y~a+b+c+d+e, data=dataframe, mtry=4, ntree=30)

现在 - 在此之后我想预测每个观察的概率，如下所示：

Prob<-predict(modelRandom, dataframe, type = 'prob')

现在问题来了：我想遍历 randomForest 模型中的一个变量 (b) 的值，并预测每个值的概率。

这个 (b) 变量包含十二个不同的值 (1:12)。我希望 R 将每个观察值的 b 变量更改为 1 并预测概率，然后再次更改 2 预测概率中所有观察值的 b 变量。然后到 3、4、5 等等。

然后应该将所有这些概率放入一张表中，并在其旁边放上相应的变量 c，如下所示：

C prob1 prob2 prob3 prob4 prob5 prob6 prob7 prob8 prob9 prob10 prob11 prob12

我想要 C 在那里，否则我无法判断概率属于哪个观察。

我想出了这个，但我认为我离我想要的还很远：

for(b in dataframe){
prob[b]<-predict(modelRandom, dataframe, type = 'prob')
}

这里有一些关于数据集的更多信息。我掩盖了其中的一些，因为它包含我显然无法分享的客户信息。

structure(list(X = c("NVT", "NVT", "NVT", "NVT", "NVT", 
"NVT"), a = structure(c(1L, 2L, 1L, 1L, 2L, 2L), .Label = c("0", 
"1"), class = "factor"), d= structure(c(2L, 2L, 1L, 1L, 1L, 2L), .Label = c("Dhr.", 
"Mevr."), class = "factor"), c = c("3331GE", "2285EH", 
"9401GE", "5591DZ", "2611CE", "1359KB"), b = structure(c(12L, 
12L, 12L, 12L, 12L, 12L), .Label = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10", "11", "12"), class = "factor"), e = structure(c(5L, 
6L, 5L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5", "6", 
"7", "8"), class = "factor"), .Names = c("X", "a", "d", "c", "b", "e"), row.names = c(NA, 
6L), class = "data.frame")

谢谢！

【问题讨论】：

请提供示例数据。使用dput(head(dataframe)) 并将控制台的输出复制到您的问题中。
已为您添加。
我假设您希望 X 在您的表中取值 "1" 的概率？
如果我们指的是包含“NVT”的X，那么没有。 X 不应该在表中。最终表格应包含每个观察的 (12) 个概率。

标签： r loops for-loop foreach

【解决方案1】：

这是一个数据池较大的示例，因为您提供的数据池不能用于构建模型：

先模拟一些数据：

r_data <- data.frame(y = as.factor(sample(0:1, 100, replace =T)), 
                     matrix(rnorm(1000), 100),
                     b = sample(1:12, 100, replace = T))

提取行名：

names_rows <- rownames(r_data)

这里我们有 y 作为二进制结果，
10 个数字特征 X1 - X10，
和 b 的值为 1 到 12

制作模型：

library(randomForest)
modelRandom <- randomForest(y~., data = r_data, mtry = 4, ntree = 30)

通过将数字特征复制 12 次并将 b - 1:12 的所有值相加，为预测生成新数据

n_row <- nrow(r_data)

newdata <- data.frame(r_data[rep(1:n_row, 12), 2:11], b = rep(1:12, each =  n_row))

获取对新数据的预测并从上面 cbind 列

preds <- data.frame(predict(modelRandom, newdata, type = 'prob'),
                    b = rep(1:12, each = n_row),
                    names_rows = as.numeric(rep(names_rows, times = 12)))

清理成所需的输出：

library(tidyverse)

preds %>%
  select(X1, b, names_rows) %>% #select only prob for outcome 1 and the b column
  group_by(b)  %>%
  mutate(z = 1 :  n_row) %>% #generate unique row identifier 
  spread(b, X1) %>% #convert to wide format
  select(-z) #remove unique row identifier 
    #output:

# A tibble: 100 x 13
   names_rows        `1`        `2`        `3`        `4`       `5`        `6`
 *      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>     <dbl>      <dbl>
 1          1 0.30000000 0.30000000 0.30000000 0.30000000 0.3000000 0.30000000
 2          2 0.70000000 0.70000000 0.73333333 0.73333333 0.7000000 0.70000000
 3          3 0.23333333 0.23333333 0.23333333 0.23333333 0.2000000 0.20000000
 4          4 0.33333333 0.30000000 0.26666667 0.26666667 0.3000000 0.26666667
 5          5 0.30000000 0.30000000 0.33333333 0.30000000 0.3000000 0.26666667
 6          6 0.23333333 0.20000000 0.16666667 0.16666667 0.2000000 0.16666667
 7          7 0.06666667 0.06666667 0.06666667 0.06666667 0.1000000 0.06666667
 8          8 0.26666667 0.23333333 0.20000000 0.20000000 0.1666667 0.16666667
 9          9 0.20000000 0.20000000 0.16666667 0.10000000 0.1000000 0.10000000
10         10 0.83333333 0.83333333 0.90000000 0.83333333 0.8333333 0.86666667
# ... with 90 more rows, and 6 more variables: `7` <dbl>, `8` <dbl>, `9` <dbl>,
#   `10` <dbl>, `11` <dbl>, `12` <dbl>

将其保存在对象中：

preds %>%
  select(X1, b, names_rows) %>% column
  group_by(b)  %>%
  mutate(z = 1 :  n_row) %>%
  spread(b, X1) %>% 
  select(-z) -> saved_object

【讨论】：

直到最后一部分一切顺利。打扫不太顺利。第一个问题是我得到： preds %>% select(X1, ti) %>% group_by(ti) %>% mutate(z = 1:n_row) %>% 中的错误：找不到函数“%>%”我删除 %>% 并运行它，它工作。但输出直接在我的控制台中。有没有办法把它放在我可以导出的数据表中？最后，连接到初始“数据帧”数据的每个观察都没有唯一标识符。我想要一个也在数据框数据中的行标识符。
@Olli Sagi 函数%>%来自library(tidyverse)。安装并加载它。它有一套强大的功能来处理各种数据。数据的排列方式与其在初始 data.frame 中的排列方式相同。我将在编辑中提供额外的代码。
好的，所以现在我想我已经搞定了。我遇到的唯一问题是您在示例中使用的“b”是一个单独的数据集。在我的环境中，它作为列位于 r_data/newdata 内。因此，当我尝试运行最后一部分时，它说：找不到 group_by(b) 对象“b”中的错误。
@Olli Sagi 请检查preds <- data.frame.... b 行在那里定义为列。
我还是有问题，在这里查看这个链接：stackoverflow.com/questions/47512830/…