【问题标题】:How to detemine the coding of accents?如何确定重音的编码?
【发布时间】:2011-10-05 16:32:27
【问题描述】:

这个问题与之前的one 有关如何将México 等重音字符串替换为等效的Latex 代码M\'{e}xico

我的问题略有不同。我正在使用带有西班牙口音的字符串变量的第三方数据库,如上面。但是,编码看起来很奇怪,因为这是我得到的行为:

> grep("México",temp$dest_nom_ent)
integer(0)
> grep("Mexico",temp$dest_nom_ent)
integer(0)
> grep("xico",temp$dest_nom_ent)
[1] 18 19 20
> temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
[2] "México" "México" "México"

其中temp$dest_nom_ent 是一个具有墨西哥州名的变量。

然后,我的问题是如何将第三方数据库中的字符串变量转换为标准R 函数可以识别的编码。请注意:

> Encoding(temp$dest_nom_ent)
 [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
 [8] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[15] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[22] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[29] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[36] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
[43] "unknown" "unknown"

有关更多信息,我使用的是 Windows 7 64。另请注意:

> charToRaw(temp$dest_nom_ent[18])
[1] 4d e9 78 69 63 6f

source 中的哪个与 Windows 西班牙语(传统排序)语言环境一致。

M=4d
é=e9
x=78
i=69
c=63
o=6f

还要注意:

> charToRaw("México")
[1] 4d c3 a9 78 69 63 6f
> Encoding("México")
[1] "latin1"

我尝试了以下失败(例如,意思是 grep("é",temp$dest_nom_ent) 返回空向量):

Encoding(temp$dest_nom_ent)<-"latin1"
temp$dest_nom_ent <- iconv(temp$dest_nom_ent,"","latin1")
temp$dest_nom_ent  <- enc2utf8(temp$dest_nom_ent)
...

我使用iconvlist() 检查了支持的字符集,并且支持"WINDOWS-1252"。但是,以下方法不起作用:

> temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
> temp1
[1] "México" "México" "México"
> Encoding(temp1)<-"WINDOWS-1252"
> temp1 <- iconv(temp1,"WINDOWS-1252","latin1")
> temp1
[1] "México" "México" "México"
> Encoding(temp1)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp1[1])
[1] 4d e9 78 69 63 6f
> grep("é",temp1)
integer(0)

比较:

> temp2 <- c("México","México","México")
> temp2
[1] "México" "México" "México"
> Encoding(temp2)
[1] "latin1" "latin1" "latin1"
> charToRaw(temp2[1])
[1] 4d c3 a9 78 69 63 6f
> grep("é",temp2)
[1] 1 2 3)

试图通过蛮力找出编码,例如:

try(for(i in 1:length(iconvlist())){
    temp1 <- temp$dest_nom_ent[grep("xico",temp$dest_nom_ent)]
    Encoding(temp1)<-iconvlist()[i]
    temp1 <- iconv(temp1,iconvlist()[i],"latin1")
    print(grep("é",temp1))
    print(i)
        },silent=FALSE)

我不熟悉 try 函数,但它仍然会出现错误而不是忽略它,因此无法检查整个列表:

...
[1] 17
integer(0)
[1] 18
integer(0)
[1] 19
integer(0)
[1] 20
Error in iconv(temp1, iconvlist()[i], "latin1") : 
  unsupported conversion from 'CP-GR' to 'latin1' in codepage 1252

最后:

> Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> d<-c("México","México")
> for(i in 1:7){d1 <- str_sub(d[1],i,i); print(d1)}
[1] "M"
[1] "Ã"
[1] "©
[1] "x"
[1] "i"
[1] "c"
[1] "o"
> print(grep("é",d))
[1] 1 2

看来我必须按照here 的建议更改计算机的区域设置。另见here

PS:如果你想知道我是如何使用 English_United States.1252 语言环境输入 d&lt;-c("México","México") 的,方法是使用 Control Panel &gt; Clock, Language and Region &gt; Region and Language &gt; Keyboards and Languages &gt; Change Keyboards 设置辅助西班牙语键盘(传统排序)并在 installed services 下单击添加和导航到西班牙传统排序。然后在advanced key settings 下,您可以创建一个快捷方式来切换键盘。在我的情况下Shit+Alt。所以如果我想在英语默认语言环境中输入ñ,我会先输入Shift+Alt,然后输入;,然后再输入Shift+Alt,以返回英文键盘。

【问题讨论】:

    标签: string r locale diacritics xtable


    【解决方案1】:

    看看temp$dest_nom_ent 和“México”的编码是什么,使用Encoding(x)。您可能需要使用enc2nativeenc2utf8 进行转换。

    【讨论】:

      【解决方案2】:

      尝试将字符串的编码设置为“ISO_8859-1”“ISO_8859-15”之一。

      还有两个建议...,然后我放弃了: "UTF-16" "UTF-16LE" 。第二个是 UTF little-endian,我相信并读过它是 Windows 7 实际使用的。不妨试试“UTF-16BE”。 (材料来自另一个 stackexchange 帖子;https://superuser.com/questions/221593/windows-7-utf-8-and-unicode

      【讨论】:

      • 同理其他两个建议
      【解决方案3】:

      好吧,我无法确定口音的编码,但以下实现了我想要的。诀窍是转换为 UTF-8,设置 sub() 选项 useBytes=TRUE 和 Joran 的 suggestion 以使用 sanitize.text.function=function(x){x}xtable()。这是示例代码。轻松遍历所有重读元音:

      > temp1 <- unique(temp$dest_nom_ent)
      > temp1
       [1] "Aguascalientes"                  "Baja California"                
       [3] "Baja California Sur"             "Campeche"                       
       [5] "Coahuila de Zaragoza"            "Colima"                         
       [7] "Chiapas"                         "Guanajuato"                     
       [9] "Guerrero"                        "Hidalgo"                        
      [11] "Jalisco"                         "México"                         
      [13] "Michoacán de Ocampo"             "Morelos"                        
      [15] "Nayarit"                         "Oaxaca"                         
      [17] "Puebla"                          "Querétaro"                      
      [19] "Quintana Roo"                    "San Luis Potosí"                
      [21] "Sinaloa"                         "Tabasco"                        
      [23] "Tlaxcala"                        "Veracruz de Ignacio de la Llave"
      [25] "Zacatecas"                      
      > temp1 <- iconv(unique(temp1),"","UTF-8")
      > temp1
       [1] "Aguascalientes"                  "Baja California"                
       [3] "Baja California Sur"             "Campeche"                       
       [5] "Coahuila de Zaragoza"            "Colima"                         
       [7] "Chiapas"                         "Guanajuato"                     
       [9] "Guerrero"                        "Hidalgo"                        
      [11] "Jalisco"                         "México"                         
      [13] "Michoacán de Ocampo"             "Morelos"                        
      [15] "Nayarit"                         "Oaxaca"                         
      [17] "Puebla"                          "Querétaro"                      
      [19] "Quintana Roo"                    "San Luis Potosí"                
      [21] "Sinaloa"                         "Tabasco"                        
      [23] "Tlaxcala"                        "Veracruz de Ignacio de la Llave"
      [25] "Zacatecas"                      
      > Encoding(temp1)
       [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"
       [8] "unknown" "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"   "unknown"
      [15] "unknown" "unknown" "unknown" "UTF-8"   "unknown" "UTF-8"   "unknown"
      [22] "unknown" "unknown" "unknown" "unknown"
      > temp2 <- sub("é", "\\\\'{e}", temp1, useBytes = TRUE)
      > temp2 <- data.frame(temp2)
      > print(xtable(temp2),sanitize.text.function=function(x){x})
      % latex table generated in R 2.13.1 by xtable 1.5-6 package
      % Fri Jul 15 13:52:44 2011
      \begin{table}[ht]
      \begin{center}
      \begin{tabular}{rl}
        \hline
       & temp2 \\ 
        \hline
      1 & Aguascalientes \\ 
        2 & Baja California \\ 
        3 & Baja California Sur \\ 
        4 & Campeche \\ 
        5 & Coahuila de Zaragoza \\ 
        6 & Colima \\ 
        7 & Chiapas \\ 
        8 & Guanajuato \\ 
        9 & Guerrero \\ 
        10 & Hidalgo \\ 
        11 & Jalisco \\ 
        12 & M\'{e}xico \\ 
        13 & Michoacán de Ocampo \\ 
        14 & Morelos \\ 
        15 & Nayarit \\ 
        16 & Oaxaca \\ 
        17 & Puebla \\ 
        18 & Quer\'{e}taro \\ 
        19 & Quintana Roo \\ 
        20 & San Luis Potosí \\ 
        21 & Sinaloa \\ 
        22 & Tabasco \\ 
        23 & Tlaxcala \\ 
        24 & Veracruz de Ignacio de la Llave \\ 
        25 & Zacatecas \\ 
         \hline
      \end{tabular}
      \end{center}
      \end{table}
      

      实际上是在循环中实现的:

      temp$dest_nom_ent <- iconv(
              temp$dest_nom_ent,"","UTF-8")
      temp$dest_nom_mun <- iconv(
              temp$dest_nom_mun,"","UTF-8")
      accents <-c("á","é","í","ó","ú")
      latex <-c("\\\\'{a}","\\\\'{e}","\\\\'{i}","\\\\'{o}","\\\\'{u}")
      for(i in 1:5){
          temp$dest_nom_ent<-sub(accents[i], latex[i], 
                  temp$dest_nom_ent, useBytes = TRUE)
          temp$dest_nom_mun<-sub(accents[i], latex[i], 
                  temp$dest_nom_ent, useBytes = TRUE)
      }
      capture.output(
              print(xtable(temp),sanitize.text.function=function(x){x}),
              file = "../paper/rTables.tex", append = FALSE)
      

      不过,答案并不完整,因为我无法解释到底发生了什么。通过反复试验找到它。

      【讨论】:

        猜你喜欢
        • 2011-11-29
        • 1970-01-01
        • 2021-04-10
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-07-15
        • 2010-09-30
        相关资源
        最近更新 更多