【问题标题】:Replace NA with mode from categorical dataset R用分类数据集 R 中的模式替换 NA
【发布时间】:2014-09-20 04:00:36
【问题描述】:

我有一个包含 categoricalNA 观察 10 个变量的数据集。我想用模式替换每列的NA 值。我做了每个变量的直方图,用于识别每个观察的密度并得到模式。我知道用什么值替换每列中的NAs。

我看到有一个相关的帖子,但我已经知道要替换哪些值。这是链接:Replace mean or mode for missing values in R

这里是重现数据集:

> #Create data with missing values
> set.seed(1)
> dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20), 
                                                  stringsAsFactors=FALSE)
> dat[c(5,10,15),1] <- NA

这是一个例子:

> #The head of the first five observations
> head(SmallStoredf, n=5)

    Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus HomeMarketValue
1  <NA>   Male            <NA>          <NA>               <NA>            <NA>            <NA>
2 45-54 Female            <NA>          <NA>               <NA>            <NA>            <NA>
5 45-54 Female        75k-100k       Married                Yes             Own       150k-200k
6 25-34   Male        75k-100k       Married                 No             Own       300k-350k
7 35-44 Female       125k-150k       Married                Yes             Own       250k-300k
  Occupation             Education LengthofResidence
1       <NA>                  <NA>              <NA>
2       <NA>                  <NA>              <NA>
5       <NA> Completed High School           9 Years
6       <NA> Completed High School       11-15 years
7       <NA> Completed High School           2 Years  

在这个例子中,我想将HomeOwnerStatus 中的NAs 替换为Own,将HomeMarketValue 替换为350K-500K,并将Occupation 替换为Professional

编辑:我尝试在其中输入值,但在其中三列出现错误。

> replacementVals <- c(Age = "45-54", Gender = "Male", HouseholdIncome = "50K-75K", 
+                      MaritalStatus = "Single", PresenceofChildren = "No",
+                      HomeOwnerStatus = "Own", HomeMarketValue = "350K-500K",
+                      Occupation = "Professional", Education = "Completed High School",
+                      LengthofResidence = "11-15yrs")
> indx1 <- replacementVals[col(df2)][is.na(df2[,names(replacementVals)])]
> df2[is.na(df2[,names(replacementVals)])]  <- indx1
#Warning messages:
#1: In `[<-.factor`(`*tmp*`, thisvar, value = c("50K-75K", "50K-75K",  :
  invalid factor level, NA generated
#2: In `[<-.factor`(`*tmp*`, thisvar, value = c("350K-500K", "350K-500K",  :
  invalid factor level, NA generated
#3: In `[<-.factor`(`*tmp*`, thisvar, value = c("11-15yrs", "11-15yrs",  :
  invalid factor level, NA generated

这是输出:

> head(SmallStoredf)

    Age Gender HouseholdIncome MaritalStatus PresenceofChildren HomeOwnerStatus HomeMarketValue
1 45-54   Male            <NA>        Single                 No             Own            <NA>
2 45-54 Female            <NA>        Single                 No             Own            <NA>
5 45-54 Female        75k-100k       Married                Yes             Own       150k-200k
6 25-34   Male        75k-100k       Married                 No             Own       300k-350k
7 35-44 Female       125k-150k       Married                Yes             Own       250k-300k
8 55-64   Male        75k-100k       Married                 No             Own       150k-200k
    Occupation             Education LengthofResidence
1 Professional Completed High School              <NA>
2 Professional Completed High School              <NA>
5 Professional Completed High School           9 Years
6 Professional Completed High School       11-15 years
7 Professional Completed High School           2 Years
8 Professional Completed High School       16-19 years

仅替换了某些列中的 NA 值。

【问题讨论】:

  • 当变量中的两个类别具有相同的最大计数时,您希望如何选择替换?
  • @Scott Davis 我猜你需要将factor 类更改为character 类。最好使用选项stringsAsFactors=FALSE 读取文件。当列是因素时,我能够复制您的错误。因此,如果您已经阅读过,请将其更改为 character columns.SmallStoredf[] &lt;- lapply(SmallStoredf, as.character)

标签: r missing-data categorical-data


【解决方案1】:

我稍微修改了您的可重现示例,这是设置

> #Create data with missing values
> set.seed(1)
> dat <- data.frame(x=sample(letters[1:3],20,TRUE), y=rnorm(20), 
                                              stringsAsFactors=FALSE)
> dat[c(5,10,15),1] <- NA
> dat[6,1]<-NA

#output
#     x                        y
#1     a  1.511781168450847978590
#2     b  0.389843236411431093291
#3     b -0.621240580541803755210
#4     c -2.214699887177499881830
#5  <NA>  1.124930918143108193874
#6     c                       NA
#7     c -0.016190263098946087311
#8     b  0.943836210685299215051
#9     b  0.821221195098088552200
#10 <NA>  0.593901321217508826322
#11    a  0.918977371608218240873
#12    a  0.782136300731067102276
#13    c  0.074564983365190601328
#14    b -1.989351695863372793127
#15 <NA>  0.619825747894710232799
#16    b -0.056128739529000784558
#17    c -0.155795506705329295238
#18    c -1.470752383899274429169
#19    b -0.478150055108620353206
#20    c  0.417941560199702411005

现在定义您的替换 val,由您希望替换 NA 的列标记

replacementVals<-c(x="Xreplace", y="Yreplace")

下一次调用可以一次性替换它们

dat[is.na(dat[,names(replacementVals)])]<-replacementVals

#          x                   y
#1         a    1.51178116845085
#2         b   0.389843236411431
#3         b  -0.621240580541804
#4         c    -2.2146998871775
#5  Xreplace    1.12493091814311
#6         c            Yreplace
#7         c -0.0161902630989461
#8         b   0.943836210685299
#9         b   0.821221195098089
#10 Yreplace   0.593901321217509
#11        a   0.918977371608218
#12        a   0.782136300731067
#13        c  0.0745649833651906
#14        b   -1.98935169586337
#15 Xreplace    0.61982574789471
#16        b -0.0561287395290008
#17        c  -0.155795506705329
#18        c   -1.47075238389927
#19        b   -0.47815005510862
#20        c   0.417941560199702

但正如 akrun 指出并随后解决的那样,这并没有很好地映射到您的第二个数据框示例。这只是直接取自他们制作的 cmets(因此无论哪种方式,他们都应该检查这个问题)

我们会做设置,除了结果我不会做所有的打印

HomeOwnerStatus = c(NA,NA,NA ,"Rent", "Rent" ) 
HomeMarketValue = c(NA,NA,NA, "350k", "350k") 
Occupation = c(NA,NA,NA, NA, NA) 
SmallStoreddf<-data.frame(HomeOwnerStatus,HomeMarketValue,Occupation, stringsAsFactors=FALSE)

replacementVals<-c("HomeOwnerStatus" = "Rent", "HomeMarketValue"="350k", "Occupation"="Professional")

然后分两步(可以合并成一条很长的线)

#get the values that we will be replacing
indx1<-replacementVals[col(SmallStoreddf)][is.na(SmallStoreddf[, names(replacementVals)])]

#do the replacement
SmallStoreddf[is.na(SmallStoredf[,names(replacementVals)])] <-indx1

#  HomeOwnerStatus HomeMarketValue   Occupation
#1             Own            350k Professional
#2             Own            350k Professional
#3             Own            350k Professional
#4            Rent            350k Professional
#5            Rent            350k Professional

【讨论】:

  • 这和akrun基本一样,我在他们发帖的时候打的。
  • 有点不同,我很喜欢。所以,你可以把它留给 OP 来做决定。但是,也许应该检查 replacementVals 的长度
  • 这行得通indx1 &lt;- replacementVals[col(SmallStoredf)][is.na(SmallStoredf[,names(replacementVals)])]; SmallStoredf[is.na(SmallStoredf[,names(replacementVals)])] &lt;- indx1 可以调整/缩短代码。
  • 是的,好电话,看来我没有处理尺寸,我需要修改答案
  • @ScottDavis 如果您可以发布代码以创建您在新编辑中显示的数据框,那将有很大帮助,因为到目前为止,我们只是在讨论它到底是什么是(看起来你使用了给定错误消息的因素,上面我使用字符串)。
【解决方案2】:

尝试:(使用您的第二个示例,因为当您显示两个数据集时有点不清楚)

indx <- which(is.na(SmallStoredf), arr.ind=TRUE)
SmallStoredf[indx] <- c("Own", "350K-500K", "Professional")[indx[,2]]
SmallStoredf
#  HomeOwnerStatus HomeMarketValue   Occupation
#1             Own       350K-500K Professional
#2             Own       350K-500K Professional
#3             Own       350K-500K Professional
#4            Rent       350k-500k Professional
#5            Rent        500k-1mm Professional

【讨论】:

    【解决方案3】:

    正在升级评论。

    如果您想用最频繁的类别替换缺失的数据,则变量中的类别数量可能相同。所以在下面的代码中,替换是从最频繁的类别中随机抽取的。

    # some example data with missing
    set.seed(1)
    dat <- data.frame(x=sample(letters[1:3],20,TRUE), 
                      y=sample(letters[1:3],20,TRUE),
                      w=rnorm(20),
                      z=sample(letters[1:3],20,TRUE),                  
                      stringsAsFactors=FALSE)
    
    dat[c(5,10,15),1] <- NA
    dat[c(3,7),2] <- NA
    
    # function to get replacement for missing
    # sample is used to randomly select categories, allowing for the case 
    # when the maximum frequency is shared by more than one category 
    
    f <- function(x) {
                    tab <- table(x)
                    l <- sum(is.na(x))
                    sample(names(tab)[tab==max(tab)], l, TRUE)
                    }
    
    # as we are using sample, set.seed before replacing
    set.seed(1)
    
    for(i in 1:ncol(dat)){
                if(!is.numeric(dat[i]))
                      dat[i][is.na(dat[i])] <- f(dat[i])
                }
    

    温和警告:在以这种方式估算缺失数据之前,您应该仔细考虑。例如,最高和最低类别的收入往往更容易丢失。通过这种方法,您可能会错误地估算平均工资。您应该考虑为什么每个变量都缺失,以及假设数据是 MCAR 或 MAR 是否合理。如果是这样,我会考虑一种更强大的估算方法(mice 包)。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-07-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多