【发布时间】:2015-11-18 05:06:05
【问题描述】:
我正在尝试清理包含职位名称的数据库中的大约 200 万个条目。许多人有几个缩写,我希望将它们更改为一个一致且更易于搜索的选项。到目前为止,我只是使用单独的mapply(gsub(...) 命令来浏览该列。但是我要进行大约 80 处更改,因此运行大约需要 30 分钟。
必须有更好的方法。我是字符串搜索的新手,我发现了 *$ 技巧,这很有帮助。有没有办法在单个mapply 中进行多次搜索?我想可能更快?
任何帮助都会很棒。谢谢。
下面是一些代码。测试是一列包含 200 万个个人职位的列。
test <- mapply(gsub, " Admin ", " Administrator ", test)
test <- mapply(gsub, "Admin ", "Administrator ", test)
test <- mapply(gsub, " Admin*$", " Administrator", test)
test <- mapply(gsub, "Acc ", " Accounting ", test)
test <- mapply(gsub, " Admstr ", " Administrator ", test)
test <- mapply(gsub, " Anlyst ", " Analyst ", test)
test <- mapply(gsub, "Anlyst ", "Analyst ", test)
test <- mapply(gsub, " Asst ", " Assistant ", test)
test <- mapply(gsub, "Asst ", "Assistant ", test)
test <- mapply(gsub, " Assoc ", " Associate ", test)
test <- mapply(gsub, "Assoc ", "Associate ", test)
【问题讨论】:
标签: regex r performance data-cleaning