【发布时间】:2011-10-05 16:32:27
【问题描述】:
这个问题与之前的one 有关如何将México 等重音字符串替换为等效的Latex 代码M\'{e}xico。
我的问题略有不同。我正在使用带有西班牙口音的字符串变量的第三方数据库,如上面。但是,编码看起来很奇怪,因为这是我得到的行为:
> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"
其中temp$dest_nom_ent 是一个具有墨西哥州名的变量。
然后,我的问题是如何将第三方数据库中的字符串变量转换为标准R 函数可以识别的编码。请注意:
> Encoding(temp$dest_nom_ent)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"
有关更多信息,我使用的是 Windows 7 64。另请注意:
> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f
source 中的哪个与 Windows 西班牙语(传统排序)语言环境一致。
M=4d
é=e9
x=78
i=69
c=63
o=6f
还要注意:
> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"
我尝试了以下失败(例如,意思是 grep("é",temp$dest_nom_ent) 返回空向量):
Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent <- enc2utf8(temp$dest_nom_ent)
...
我使用iconvlist() 检查了支持的字符集,并且支持"WINDOWS-1252"。但是,以下方法不起作用:
> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)
比较:
> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)
试图通过蛮力找出编码,例如:
try(for(i in 1:length(iconvlist())){
temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
Encoding(temp1)<-iconvlist()[i]
temp1 <- iconv(temp1,iconvlist()[i],"latin1")
print(grep("é",temp1))
print(i)
},silent=FALSE)
我不熟悉 try 函数,但它仍然会出现错误而不是忽略它,因此无法检查整个列表:
...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") :
unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252
最后:
> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2
看来我必须按照here 的建议更改计算机的区域设置。另见here
PS:如果你想知道我是如何使用 English_United States.1252 语言环境输入 d<-c("México","México") 的,方法是使用 Control Panel > Clock, Language and Region > Region and Language > Keyboards and Languages > Change Keyboards 设置辅助西班牙语键盘(传统排序)并在 installed services 下单击添加和导航到西班牙传统排序。然后在advanced key settings 下,您可以创建一个快捷方式来切换键盘。在我的情况下Shit+Alt。所以如果我想在英语默认语言环境中输入ñ,我会先输入Shift+Alt,然后输入;,然后再输入Shift+Alt,以返回英文键盘。
【问题讨论】:
标签: string r locale diacritics xtable