数据清洗：将相似字段聚合在一个[重复]答案

【问题标题】：Data cleaning: agreggate similar fields in one [duplicate]数据清洗：将相似字段聚合在一个[重复]
【发布时间】：2021-02-24 16:37:15
【问题描述】：

如何按常见模式对变量的元素进行分组。例如，我有一个数据库，其中有一个名为 company role 的字段，我希望能够将常见的角色归为一个。

employee <- c("a", "b", "c", "d", "e")
Rol      <- c(" accounting assistant", "accou assist", "account.assistant", 
              "healt aux", "auxiliary in healt")
DF <- data.frame(employee, Rol)

我想把它变成这样的东西

Employeee	ROL
A	accounting assistant
B	accounting assistant
C	accounting assistant
D	Healt auxiliary
E	Healt auxiliary

目前我正在手动识别模式，但随着数据的增长，任务变得更加复杂，我将不胜感激。谢谢！

【问题讨论】：

对于Healt auxiliary，是否有键/值对
试试cbind(DF, cl=cutree(hclust(as.dist(adist(tolower(DF$Rol)))), h=16))。

标签： r data-cleaning

【解决方案1】：

library(dplyr)
# your dataframe
employee <- c("a", "b", "c", "d", "e")
Rol      <- c(" accounting assistant", "accou assist", "account.assistant", 
              "healt aux", "auxiliary in healt")
DF <- data.frame(employee, Rol)

# save the terms you want to unify in vectors
vector_accounting <- c(" accounting assistant", "accou assist", "account.assistant")
vector_healt <- c("healt aux", "auxiliary in healt")

# apply changes with %in%
DF1 <- DF %>% 
  mutate(Rol = case_when(Rol %in% vector_accounting ~ "accounting assistent",
                         Rol %in% vector_healt ~ "Heat auxiliary"))

【讨论】：

我认为 OP 不想要这个。据我所知，有些模糊的加入
好的。我懂了。谢谢。