【发布时间】:2015-05-24 03:47:09
【问题描述】:
我有一个所有大写的所有者名称列表,我想将其转换为正确的大写:
owner1
1: DXXXXX JOSEPH V JR
2: MIRNA NXXXXX
3: ADRIAN TXXXX
4: CUTLER PXXXXXXXXX LLC
5: GVM PXXXXXXXXX LLC
6: EARLENA RXXXXXXX
7: NATHANIEL TXXXXX
8: DXXXXXX DONNA
9: LXXXX ELAINE E TR
10: SXXXXXX KIMBERLY
(用于复制目的:
owner1<-c("DXXXXX JOSEPH V JR","MIRNA NXXXXX","ADRIAN TXXXX",
"CUTLER PXXXXXXXXX LLC","GVM PXXXXXXXXX LLC",
"EARLENA RXXXXXXX","NATHANIEL TXXXXX","DXXXXXX DONNA",
"LXXXX ELAINE E TR","SXXXXXX KIMBERLY")
)
期望的输出:
owner1
1: Dxxxxx Joseph V. Jr
2: Mirna Nxxxxx
3: Adrian Txxxx
4: Cutler Pxxxxxxxxx LLC
5: GVM Pxxxxxxxxx LLC
6: Earlena Rxxxxxxx
7: Nathaniel Txxxxx
8: Dxxxxxx Donna
9: Lxxxx Elaine E. TR
10: Sxxxxxx Kimberly
重要的第一步是?chartr 中提到的.simpleCap 函数的一个版本:
.simpleCap <- function(x) {
s <- strsplit(tolower(x), " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep = "", collapse = " ")
}
这是问题的很大一部分,但在 4、5 和 9 上都失败了。我可以补充它以分别处理关键短语(LLC、TR 等),但这仍然留下类似于观察 5 的内容。
这是我目前得到的函数(下面的@eipi10 的解决方案大大加快了速度,该解决方案对.simpleCap 函数进行了矢量化,允许将整个函数应用于向量):
to.proper<-function(strings){
#vectorized version of .simpleCap;
# I've also built in that I know `strings` is all caps
res<-gsub("\\b([A-Z])([A-Z]+)*","\\U\\1\\L\\2",strings,perl=T)
#In my data, some Irish/Scottish names separated the MC prefix
# Also, re-capitalize following a hyphen
res<-gsub("\\bMc\\s","Mc",gsub("(-.)","\\U\\1",res,perl=T))
for (init in c("[A-Z]","Inc","Assoc","Co",
"Jr","Sr","Tr","Bros")){
#Add a period after common abbreviations
res<-gsub(paste0("\\b(",init,")\\b"),"\\1.",res)
}
for (abbr in c("[B-DF-HJ-NP-TV-XZ][b-df-hj-np-tv-xz]{2,}",
"Pa","Ii","Iii","Iv","Lp","Tj",
"Xiv","Ll","Yml","Us")){
#Re-capitalize any string of >=3 consonants (excluding
# Y for such names as LYNN and WYNN), as well as
# some other common phrases that need upper-casing
res<-gsub(paste0("\\b(",abbr,")\\b"),"\\U\\1",res,perl=T)
}
#Re-capitalize post-Mc letters, e.g. in Mcmahon
gsub("\\bMc([a-z])","Mc\\U\\1",res,perl=T)
}
对于在此过程中保留可能无法预测的缩写(特别是观察 5 中不常见的缩写)的稳健方法有什么想法吗?
【问题讨论】:
-
我认为您可能需要一些后缀列表才能将
LLC, TR排除在匹配之外并且不用于大写中 -
除了@akrun 的建议,你试过stringi 包中的stri_trans_totitle() 吗?
-
@lawyeR 这也应该给出同样的问题。我试过了:-)
-
@lawyeR 是
stringi的开发版吗?我在documentation 中没有看到它 -
是的,请查看第 132 页的 pdf 文档