【问题标题】:Data cleaning: agreggate similar fields in one [duplicate]数据清洗:将相似字段聚合在一个[重复]
【发布时间】:2021-02-24 16:37:15
【问题描述】:

如何按常见模式对变量的元素进行分组。例如,我有一个数据库,其中有一个名为 company role 的字段,我希望能够将常见的角色归为一个。

employee <- c("a", "b", "c", "d", "e")
Rol      <- c(" accounting assistant", "accou assist", "account.assistant", 
              "healt aux", "auxiliary in healt")
DF <- data.frame(employee, Rol)

我想把它变成这样的东西

Employeee ROL
A accounting assistant
B accounting assistant
C accounting assistant
D Healt auxiliary
E Healt auxiliary

目前我正在手动识别模式,但随着数据的增长,任务变得更加复杂,我将不胜感激。谢谢!

【问题讨论】:

  • 对于Healt auxiliary,是否有键/值对
  • 试试cbind(DF, cl=cutree(hclust(as.dist(adist(tolower(DF$Rol)))), h=16))

标签: r data-cleaning


【解决方案1】:
library(dplyr)
# your dataframe
employee <- c("a", "b", "c", "d", "e")
Rol      <- c(" accounting assistant", "accou assist", "account.assistant", 
              "healt aux", "auxiliary in healt")
DF <- data.frame(employee, Rol)

# save the terms you want to unify in vectors
vector_accounting <- c(" accounting assistant", "accou assist", "account.assistant")
vector_healt <- c("healt aux", "auxiliary in healt")

# apply changes with %in%
DF1 <- DF %>% 
  mutate(Rol = case_when(Rol %in% vector_accounting ~ "accounting assistent",
                         Rol %in% vector_healt ~ "Heat auxiliary"))

【讨论】:

  • 我认为 OP 不想要这个。据我所知,有些模糊的加入
  • 好的。我懂了。谢谢。
猜你喜欢
  • 1970-01-01
  • 2015-05-27
  • 1970-01-01
  • 2012-12-14
  • 2021-10-02
  • 1970-01-01
  • 2018-12-19
  • 2018-05-10
  • 2016-11-19
相关资源
最近更新 更多