使用正则表达式匹配编辑（重新编码、折叠、排序）因子级别答案

【问题标题】：Edit (recode, collapse, order) factor levels with regex matching使用正则表达式匹配编辑（重新编码、折叠、排序）因子级别
【发布时间】：2016-10-09 12:21:27
【问题描述】：

我发现在 R 中操纵因子变量过于复杂。清洁因素时我经常想做的事情包括：

重新排序级别 – 不仅仅是设置参考类别，还可以将所有级别按逻辑（非字母顺序）汇总表。 x <- factor(x, levels = new.order)
重新编码/重命名因子级别 - 以简化名称和/或将多个类别合并为一组。对于一对一重新编码levels(x) <- new.levels(x) 或plyr::revalue，请参阅here 或here 示例。 car::recode 可以在单个语句中执行多个一对多匹配，但不支持正则表达式匹配。
删除关卡 - 不只是删除未使用的关卡，而是将一些关卡设置为缺失。（例如那些有错误代码的）。 x <- factor(as.character(x), exclude = drop.levels)
添加级别 - 显示计数为零的类别。

最好有一个函数可以同时完成上述所有操作，允许对重新编码和删除因子进行模糊（正则表达式）匹配，可以在其他函数中使用（例如 lapply 或 @987654330 @)，并且具有简单（一致）的语法。

我已经在下面发布了我对此的最佳尝试作为答案，但是如果我错过了已经存在的功能或者代码是否可以改进，请告诉我。

编辑

我知道forcats 包，它的副标题是使用分类变量（因子）的工具。该软件包有许多选项用于重新调整级别（'fct_infreq'、'fct_reorder'、'fct_relevel'、...）、重新编码/分组级别（'fct_recode'、'fct_lump'、'fct_collapse'）、下降级别（'fct_recode' )，并添加级别 ('fct_expand')。但没有计划支持正则表达式匹配 (https://github.com/tidyverse/forcats/issues/214)。

【问题讨论】：

“一步到位”是什么意思？
@effel 我想我正在考虑使用单行命令来执行所有可以合并到 lapply 命令或类似命令中的操作。尽管我承认这可以通过将所有内容打包到自定义函数中来在 R 中完成。我还想知道我是否错过了来自 dplyr 或其他包的命令，该命令执行 car::recode 的功能，但语法更友好。

标签： r regex r-factor

【解决方案1】：

编辑：几年后，我在 github 上添加了 xfactor 函数来完成上述工作。它仍在进行中，所以如果有任何错误等，请告诉我。

devtools::install_github("jwilliman/xfactor")


library(xfactor)

# Create example factor
x <- xfactor(c("dogfish", "rabbit","catfish", "mouse", "dirt"))
levels(x)
#> [1] "catfish" "dirt"    "dogfish" "mouse"   "rabbit"

# Factor levels can be reordered by passing an unnamed vector to the levels
# statement. Levels not included in the replace statement get moved to the end
# or dropped if exclude = TRUE.
xfactor(x, levels = c("mouse", "rabbit"))
#> [1] dogfish rabbit  catfish mouse   dirt   
#> Levels: mouse rabbit catfish dirt dogfish

xfactor(x, levels = c("mouse", "rabbit"), exclude = TRUE)
#> [1] <NA>   rabbit <NA>   mouse  <NA>  
#> Levels: mouse rabbit

# Factor levels can be recoded, collapse, and ordered by passing a named
# vector to the levels statement. Where the vector names are the new factor
# levels and the vector values are regex expressions for the old levels.
# Duplicated new levels will be collapsed.

xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou"))
#> [1] Sea  Land Sea  Land dirt
#> Levels: Sea Land dirt

# Factor levels can be dropped by passing a regex expression (or vector) to
# the exclude statement

xfactor(x, exclude = "fish")
#> [1] <NA>   rabbit <NA>   mouse  dirt  
#> Levels: dirt mouse rabbit

# The function will work within other functions

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data.frame(n = 1:5, x)
df %>%
  mutate(y = xfactor(x, levels = c("Sea" = "fish", "Land" = "rab|mou", "Air"), exclude = "di"))
#>   n       x    y
#> 1 1 dogfish  Sea
#> 2 2  rabbit Land
#> 3 3 catfish  Sea
#> 4 4   mouse Land
#> 5 5    dirt <NA>

^{由reprex package (v0.3.0) 于 2020 年 4 月 16 日创建}

【讨论】：