R：因子水平，将剩余重新编码为“其他”答案

【问题标题】：R: factor levels, recode rest to 'other'R：因子水平，将剩余重新编码为“其他”
【发布时间】：2013-03-20 20:05:12
【问题描述】：

我很少使用因子，并且通常认为它们是可以理解的，但我经常对特定操作的细节感到模糊。目前，我正在对几乎没有观察到“其他”的类别进行编码/折叠，并且正在寻找一种快速的方法来做到这一点——我可能有 20 个级别的变量，但我有兴趣将它们中的一堆折叠成一个。

data <- data.frame(employees = sample.int(1000,500),
                   naics = sample(c('621111','621112','621210','621310','621320','621330','621340','621391','621399','621410','621420','621491','621492','621493','621498','621511','621512','621610','621910','621991','621999'),
                                  100, replace=T))

这是我的兴趣等级，以及它们在不同向量中的标签。

#levels and labels
top8 <-c('621111','621210','621399','621610','621330',
         '621310','621511','621420','621320')
top8_desc <- c('Offices of physicians',
               'Offices of dentists',
               'Offices of all other miscellaneous health practitioners',
               'Home health care services',
               'Offices of Mental Health Practitioners',
               'Offices of chiropractors',
               'Medical Laboratories',
               'Outpatient Mental Health and Substance Abuse Centers',
               'Offices of optometrists')

我可以使用factor() 调用，将它们全部列举出来，每当一个类别很少观察到时，将它们归类为“其他”。

假设上面的top8 和top8_desc 是实际的前8 位，那么将data$naics 声明为因子变量的最佳方法是什么，以便top8 中的值被正确编码，而其他所有内容都被重新编码作为other?

【问题讨论】：

标签： r r-factor

【解决方案1】：

我认为最简单的方法是将所有不在前 8 名中的 naics 重新标记为特殊值。

data$naics[!(data$naics %in% top8)] = -99

然后你可以在将其转化为因子时使用“排除”选项

factor(data$naics, exclude=-99)

【讨论】：

嗯，这实际上涉及丢弃数据而不是更改分类，但这可能是编码作为一个因素首先要做的事情。我想这并不重要。
您始终可以使用转换后的代码在数据框中创建一个额外的列。
我尝试了您的回复：levels(data$naics)[which(!levels(data$naics)%in%top8)] <- "other"

【解决方案2】：

你可以使用forcats::fct_other():

library(forcats)
data$naics <- fct_other(data$naics, keep = top8, other_level = 'other')

或将fct_other() 用作dplyr::mutate() 的一部分：

library(dplyr)
data <- mutate(data, naics = fct_other(naics, keep = top8, other_level = 'other')) 

data %>% head(10)
   employees  naics
1        420  other
2        264  other
3        189  other
4        157 621610
5        376 621610
6        236  other
7        658 621320
8        959 621320
9        216  other
10       156  other

请注意，如果参数other_level 未设置，则其他级别默认为“Other”（大写“O”）。

相反，如果您只想将几个因素转换为“其他”，则可以改用drop 参数：

data %>%  
  mutate(keep_fct = fct_other(naics, keep = top8, other_level = 'other'),
         drop_fct = fct_other(naics, drop = top8, other_level = 'other')) %>% 
  head(10)

   employees  naics keep_fct drop_fct
1        474 621491    other   621491
2        805 621111   621111    other
3        434 621910    other   621910
4        845 621111   621111    other
5        243 621340    other   621340
6        466 621493    other   621493
7        369 621111   621111    other
8         57 621493    other   621493
9        144 621491    other   621491
10       786 621910    other   621910

dpylr 也有 recode_factor()，您可以在其中将 .default 参数设置为 other，但是要重新编码的级别数量较多，就像这个例子一样，可能很乏味：

data %>% 
   mutate(naices = recode_factor(naics, `621111` = '621111', `621210` = '621210', `621399` = '621399', `621610` = '621610', `621330` = '621330', `621310` = '621310', `621511` = '621511', `621420` = '621420', `621320` = '621320', .default = 'other'))

【讨论】：

【解决方案3】：

迟到

这是plyr::mapvalues 的包装器，它允许remaining 参数（您的other）

library(plyr)

Mapvalues <- function(x, from, to, warn_missing= TRUE, remaining = NULL){
  if(!is.null(remaining)){
    therest <- setdiff(x, from)
    from <- c(from, therest)
    to <- c(to, rep_len(remaining, length(therest)))
  }
  mapvalues(x, from, to, warn_missing)
}
# replace the remaining values with "other"
Mapvalues(data$naics, top8, top8_desc,remaining = 'other')
# leave the remaining values alone
Mapvalues(data$naics, top8, top8_desc)

【讨论】：

【解决方案4】：

我写了一个函数来做这个，可能对其他人有用吗？我首先以相对的方式检查，如果一个水平出现低于基数的 mp 百分比。之后，我检查以将最大级别数限制为 ml。

ds 是 data.frame 类型的数据集，我对 cat_var_names 中作为因子出现的所有列执行此操作。

cat_var_names <- names(clean_base[sapply(clean_base, is.factor)])

recodeLevels <- function (ds = clean_base, var_list = cat_var_names, mp = 0.01, ml = 25) {
  # remove less frequent levels in factor
  # 
  n <- nrow(ds)
  # keep levels with more then mp percent of cases
  for (i in var_list){
    keep <- levels(ds[[i]])[table(ds[[i]]) > mp * n]
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }

  # keep top ml levels
  for (i in var_list){
    keep <- names(sort(table(ds[i]),decreasing=TRUE)[1:ml])
    levels(ds[[i]])[which(!levels(ds[[i]])%in%keep)] <- "other"
  }
  return(ds)
}

【讨论】：

这没有提供问题的答案。要批评或要求作者澄清，请在他们的帖子下方发表评论 - 您可以随时评论自己的帖子，一旦您有足够的reputation，您就可以comment on any post。