【发布时间】:2018-12-04 17:15:41
【问题描述】:
require (data.table)
dat1 <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/primary-tumor/primary-tumor.data',stringsAsFactors=T)
我想用每列最频繁的值替换? 和缺失值,并将它们设为factor(对于RandomForest)。
我试图从 dat1$V4 中省略 ?:
> dat2=subset(dat1, dat1$V4!='?')
Error in `[.data.table`(x, r, vars, with = FALSE) :
i evaluates to a logical vector length 339 but there are 184 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.
然后在成功的情况下用于制作所有dataframe 列factor:
dat1 <- data.frame(lapply(dat1, as.factor))
这是dat1的标头:
> head (dat1)
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
1: 1 1 1 ? 3 2 2 1 2 2 2 2 2 2 2 2 2 2
2: 1 1 1 ? 3 2 2 2 2 2 1 2 2 2 1 2 1 2
3: 1 1 2 2 3 1 2 2 2 2 2 2 2 2 2 2 1 2
4: 1 1 2 ? 3 1 2 1 1 2 2 2 2 2 2 2 1 2
5: 1 1 2 ? 3 1 2 1 1 2 2 2 2 2 2 2 1 2
6: 1 1 2 ? 3 1 2 2 2 2 2 1 2 2 1 1 1 2
这里是str(dat1):
> str (dat1)
Classes ‘data.table’ and 'data.frame': 339 obs. of 18 variables:
$ V1 : int 1 1 1 1 1 1 1 1 1 1 ...
$ V2 : int 1 1 1 1 1 1 2 2 2 2 ...
$ V3 : Factor w/ 3 levels "1","2","?": 1 1 2 2 2 2 1 1 1 1 ...
$ V4 : Factor w/ 4 levels "1","2","3","?": 4 4 2 4 4 4 1 1 1 1 ...
$ V5 : Factor w/ 4 levels "1","2","3","?": 3 3 3 3 3 3 1 1 1 2 ...
$ V6 : int 2 2 1 1 1 1 1 1 2 1 ...
$ V7 : int 2 2 2 2 2 2 2 2 2 2 ...
$ V8 : int 1 2 2 1 1 2 2 2 2 2 ...
$ V9 : int 2 2 2 1 1 2 2 2 2 2 ...
$ V10: int 2 2 2 2 2 2 2 2 2 2 ...
$ V11: int 2 1 2 2 2 2 2 2 2 2 ...
$ V12: int 2 2 2 2 2 1 2 2 2 2 ...
$ V13: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 2 1 2 2 3 ...
$ V14: int 2 2 2 2 2 2 1 2 1 1 ...
$ V15: int 2 1 2 2 2 1 1 2 2 1 ...
$ V16: Factor w/ 3 levels "1","2","?": 2 2 2 2 2 1 2 2 2 2 ...
$ V17: int 2 1 1 1 1 1 2 2 2 2 ...
$ V18: int 2 2 2 2 2 2 2 2 2 2 ...
- attr(*, ".internal.selfref")=<externalptr>
【问题讨论】:
-
不要使用
subset(dat1, V4 != '?')中的数据框名称。
标签: r dataframe missing-data data-cleaning