R中数据框中缺失值的有效清理答案

【问题标题】：Efficient cleaning of missing value in dataframe in RR中数据框中缺失值的有效清理
【发布时间】：2018-12-04 17:15:41
【问题描述】：

require (data.table)
dat1 <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/primary-tumor/primary-tumor.data',stringsAsFactors=T)

我想用每列最频繁的值替换? 和缺失值，并将它们设为factor（对于RandomForest）。我试图从 dat1$V4 中省略 ?：

> dat2=subset(dat1, dat1$V4!='?')
Error in `[.data.table`(x, r, vars, with = FALSE) : 
  i evaluates to a logical vector length 339 but there are 184 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.

然后在成功的情况下用于制作所有dataframe 列factor：

dat1 <- data.frame(lapply(dat1, as.factor))

这是dat1的标头：

> head (dat1)
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1:  1  1  1  ?  3  2  2  1  2   2   2   2   2   2   2   2   2   2
2:  1  1  1  ?  3  2  2  2  2   2   1   2   2   2   1   2   1   2
3:  1  1  2  2  3  1  2  2  2   2   2   2   2   2   2   2   1   2
4:  1  1  2  ?  3  1  2  1  1   2   2   2   2   2   2   2   1   2
5:  1  1  2  ?  3  1  2  1  1   2   2   2   2   2   2   2   1   2
6:  1  1  2  ?  3  1  2  2  2   2   2   1   2   2   1   1   1   2

这里是str(dat1)：

> str (dat1)
Classes ‘data.table’ and 'data.frame':  339 obs. of  18 variables:
 $ V1 : int  1 1 1 1 1 1 1 1 1 1 ...
 $ V2 : int  1 1 1 1 1 1 2 2 2 2 ...
 $ V3 : Factor w/ 3 levels "1","2","?": 1 1 2 2 2 2 1 1 1 1 ...
 $ V4 : Factor w/ 4 levels "1","2","3","?": 4 4 2 4 4 4 1 1 1 1 ...
 $ V5 : Factor w/ 4 levels "1","2","3","?": 3 3 3 3 3 3 1 1 1 2 ...
 $ V6 : int  2 2 1 1 1 1 1 1 2 1 ...
 $ V7 : int  2 2 2 2 2 2 2 2 2 2 ...
 $ V8 : int  1 2 2 1 1 2 2 2 2 2 ...
 $ V9 : int  2 2 2 1 1 2 2 2 2 2 ...
 $ V10: int  2 2 2 2 2 2 2 2 2 2 ...
 $ V11: int  2 1 2 2 2 2 2 2 2 2 ...
 $ V12: int  2 2 2 2 2 1 2 2 2 2 ...
 $ V13: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 2 1 2 2 3 ...
 $ V14: int  2 2 2 2 2 2 1 2 1 1 ...
 $ V15: int  2 1 2 2 2 1 1 2 2 1 ...
 $ V16: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 1 2 2 2 2 ...
 $ V17: int  2 1 1 1 1 1 2 2 2 2 ...
 $ V18: int  2 2 2 2 2 2 2 2 2 2 ...
 - attr(*, ".internal.selfref")=<externalptr>

【问题讨论】：

不要使用subset(dat1, V4 != '?')中的数据框名称。

标签： r dataframe missing-data data-cleaning

【解决方案1】：

虽然它有点“hacky”，但这应该能让你到达那里。我在您的 data.frame 中没有看到任何 NA。

library(dplyr)
library(stringr)

dat1 <- read.table('https://archive.ics.uci.edu/ml/machine-learning- 
databases/primary-tumor/primary-tumor.data',stringsAsFactors=T, sep = ",")

dat1 <- sapply(dat1, as.character)
temp <- list()

for (i in 1:ncol(dat1)){
  temp[i] <- data.frame(str_replace(dat1[,i], "[?]",names(sort(table(dat1[,i]), 
decreasing = T))[1]))

}
dat2 <- bind_cols(temp)
colnames(dat2) <- colnames(dat1)

【讨论】：

已编辑。应该是temp <- list()

【解决方案2】：

以下函数将所有NA 和'?' 值替换为最频繁的列值。然后只需 lapply将其发送到 data.frame 即可。

mostFreq <- function(x, na = '?'){
  i <- is.na(x) | x %in% na
  tbl <- table(x[!i])
  x[i] <- names(tbl)[which.max(tbl)]
  if(is.factor(x)) x <- droplevels(x)
  x
}

# Before    
as.list(dat1[1:20, 1:3])
#$V1
# [1] "1" "?" "2" "?" "2" NA  "?" "?" "2" "?" "?" "?" NA  NA 
#[15] NA  NA  "?" "2" "2" "2"
#
#$V2
# [1] "1" "3" "2" "3" "1" "2" "1" "2" "3" "1" "2" "1" "?" NA 
#[15] "?" "3" "1" NA  "?" "1"
#
#$V3
# [1] "?" "1" "?" "3" "1" NA  NA  "3" "1" "1" "1" "2" NA  NA 
#[15] NA  NA  "?" "?" NA  "2"

# After
lapply(dat1[1:20, 1:3], mostFreq)
#$V1
# [1] "1" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2" "2"
#[15] "2" "2" "2" "2" "2" "2"
#
#$V2
# [1] "1" "3" "2" "3" "1" "2" "1" "2" "3" "1" "2" "1" "1" "1"
#[15] "1" "3" "1" "1" "1" "1"
#
#$V3
# [1] "1" "1" "1" "3" "1" "1" "1" "3" "1" "1" "1" "2" "1" "1"
#[15] "1" "1" "1" "1" "1" "2"

并更改整个数据框。

dat1[] <- lapply(dat1, mostFreq)

并强制上课factor：

dat1[] <- lapply(dat1, factor)

编辑。

上面的功能可以通过读取数据设置na.strings = '?'来进行简化。

dat1 <- fread(<URI>, na.strings = '?', <other args>)

然后在原来mostFreq的地方使用下面的函数。

mostFreq2 <- function(x){
  tbl <- table(x, useNA = "no")
  x[is.na(x)] <- names(tbl)[which.max(tbl)]
  x
}

测试数据。

由于您尚未发布示例数据集，我将创建一个类似于问题描述的数据集。

set.seed(1234)    # Make the results reproducible
n <- 300
x <- replicate(6, sample(c(NA, '?', 1:2), n, TRUE))
y <- replicate(6, sample(c(NA, '?', 1:3), n, TRUE))
dat1 <- cbind.data.frame(x, y, stringsAsFactors = FALSE)
dat1 <- dat1[, sample(ncol(dat1))]
names(dat1) <- paste0('V', 1:12)
str(dat1)

【讨论】：

如果您使用来自fread 的参数na.strings，您可以简化此操作。然后不需要使用i 或drop.levels
@Moody_Mudskipper 你是对的，我假设 OP 想要清理数据并完全忘记了na.strings。谢谢。至于stringsAsFactors，在数据创建代码中。
感谢@RuiBarradas 在定义了 mostFreq 之后，我使用 dat2
@Avi 你说的 MANY 是什么意思？