【发布时间】:2014-10-04 03:53:44
【问题描述】:
我不确定如何编写一个函数来替换一系列分类向量中的 NA 数据。
考虑以下问题:我有一个包含 NA 数据的分类向量,我想根据现有数据的比例替换 NA 数据。
例如,
a<-factor(c("yes","no","no","yes","yes","yes","no","yes","yes","yes","yes","yes",NA, NA))
我写了以下代码:
a[is.na(a)]<-sample(c("yes","no"),sum(is.na(a)),replace=TRUE,
prob=c(sum(na.omit(a=="yes"))/sum(!is.na(a)),sum(na.omit(a=="no"))/sum(!is.na(a))))
## replace NA with yes or no according to the proportion of yes/no in the non-NA data
上面的代码工作正常,但现在我有一个包含许多分类变量的数据框。 例如:
a<-c("yes","no","no","yes","yes","yes","no","yes","yes","yes","yes","yes",NA, NA)
b<-c("red","blue","white","red","blue","red","blue","red","blue","red","blue",NA,NA,NA)
c<-c(1,3,2,1,2,3,1,2,3,1,2,3,NA,NA)
a<-as.factor(a) ## ensure the vectors are treated as categorical variable
b<-as.factor(b)
c<-as.factor(c)
df<-data.frame(a=a,b=b,c=c)
我正在努力编写一个函数,该函数允许我替换此类数据框中所有分类变量中的 NA 数据。请注意,每个变量可能有两个以上的类别。
【问题讨论】:
-
在
b和c列中,您希望prob怎么样 -
所以 b 是
prob=c(sum(na.omit(b=="red"))/sum(!is.na(a)),sum(na.omit(b=="blue"))/sum(!is.na(a)),sum(na.omit(b=="white"))/sum(!is.na(a)))而 c 是prob=c(sum(na.omit(c==1))/sum(!is.na(a)),sum(na.omit(c==2))/sum(!is.na(a)),sum(na.omit(c==3))/sum(!is.na(a))) -
为什么
is.na(a)用于计算b列中的概率? -
所以 b 将是
prob=c(sum(na.omit(b=="red"))/sum(!is.na(b)),sum(na.omit(b=="blue"))/sum(!is.na(b)),sum(na.omit(b=="white"))/sum(!is.na(b)))而 c 将是prob=c(sum(na.omit(c==1))/sum(!is.na(c)),sum(na.omit(c==2))/sum(!is.na(c)),sum(na.omit(c==3))/sum(!is.na(c))) -
这相当于 David Arenburg 的
prob函数