在分类变量上使用“missForest”在 r 中进行多重插补答案

【问题标题】：Multiple imputation in r using "missForest" on categorical variables在分类变量上使用“missForest”在 r 中进行多重插补
【发布时间】：2019-06-17 08:44:36
【问题描述】：

我在几列中有带有 NA 的调查数据集。因此，我决定使用“missForest”包执行多重插补来插补缺失值。这不是问题，但是我在检查我的数据后注意到，许多估算值是数字，列中的十进制值是以前的因子。

我假设 missForest 要求列是数字的（它需要 x 的 data.matrix）才能执行插补。

NRMSE 相当好，估算值列的均值与 NA 列相似。

我计划使用具有估算值的数据集进行多级线性回归，并且无论如何都会将因子列转换为数值。

这些带小数位的数值是否会造成问题？

finalmatrix <- data.matrix(final)
set.seed(666)
impforest <- missForest(finalmatrix, variablewise = TRUE, parallelize = 
"forests")

【问题讨论】：

标签： r random-forest categorical-data survey imputation

【解决方案1】：

我不知道你的数据或者你的代码，但是missForest 绝对能够处理混合类型的数据。（并且不会自动转换这些）

这是missForest手册中的一个例子：

## Nonparametric missing value imputation on mixed-type data:
## Take a look at iris definitely has a variable that is a factor 
library(missForest)
data(iris)
summary(iris)

## The data contains four continuous and one categorical variable.
## Artificially produce missing values using the 'prodNA' function:
set.seed(81)
iris.mis <- prodNA(iris, noNA = 0.2)
summary(iris.mis)

## Impute missing values providing the complete matrix for
## illustration. Use 'verbose' to see what happens between iterations:
iris.imp <- missForest(iris.mis, xtrue = iris, verbose = TRUE)


## Here are the final results
iris.imp

##As can be seen here it still has the factor column
str(iris.imp$ximp)

【讨论】：