【问题标题】:R encoding - Saved as UTF-8 with wrong characters (I think)R 编码 - 用错误的字符保存为 UTF-8(我认为)
【发布时间】:2018-01-16 05:48:09
【问题描述】:

我有一个文件明确表示它是 UTF-8,unix 命令file -i 表示它被编码为 UTF-8,但是当我将它加载到 R 中时(使用带有 UTF8 编码的 readr),我仍然可以清楚告诉多字节字符是错误的。当我指定“Windows-1252”(基于this chart,我很确定它最初的编码是什么)作为编码时,我得到更多不正确的字符。

我认为发生的事情是有人将这些不正确的字符保存为 UTF-8。有没有办法恢复原文?

以下是通过指定编码进行修复的尝试:

library(curl)
library(readr)
#> 
#> Attaching package: 'readr'
#> The following object is masked from 'package:curl':
#> 
#>     parse_date

text_file <- tempfile()
curl_download("https://dl.dropboxusercontent.com/s/7syikmmiduubsqv/test.txt", text_file)


# Default is UTF-8, other specifications add extra characters
read_lines(text_file)
#> [1] "{Província}"
# read_lines(text_file, locale = locale(encoding = "UTF-8")) # same
read_lines(text_file, locale = locale(encoding = "Windows-1252"))
#> [1] "{ProvÃ<U+0083>­ncia}"
read_lines(text_file, locale = locale(encoding = "latin1"))
#> [1] "{ProvÃ<U+0083>­ncia}"

# Same as equivalent readr code
# readLines(text_file)
# readLines(text_file, encoding = "UTF-8")
# readLines(text_file, encoding = "UTF-8-BOM")
# readLines(text_file, encoding = "Windows-1252")

# Desired text: "{Prov\u00EDncia}"

更新

反向编码(例如 Stat545 example)不起作用

iconv(read_lines(text_file), from = "UTF-8", to = "Latin1")
#> [1] "{Província}"
iconv(read_lines(text_file), from = "UTF-8", to = "Windows-1252")
#> [1] "{Província}"

【问题讨论】:

    标签: r encoding utf-8 character-encoding


    【解决方案1】:

    好吧,我想有一个更好的方法来解决这个问题,但在有人发布之前,这里有一个解决方案,它从网站创建表格并用文本替换它。

    (需要字符串)


    # Create the Debugging table from http://www.i18nqa.com/debug/utf8-debug.html
    # UTF-8 characters were interpreted as Windows-1252 and then saved
    # as UTF-8
    create_utf_crosswalk <- function() {
      # Affects Windows-1252 0x80 - 0xFF (but a few characters aren't in
      # the spec, so  remove them)
      hex_codes <- sprintf("%x", seq(strtoi("0x80"), strtoi("0xFF")))
      hex_codes <- hex_codes[!hex_codes %in% c("81", "8d", "8f", "90", "9f")]
    
      actual_chars_locale <- vapply(hex_codes, FUN.VALUE = character(1), function(x) {
        parse(text = paste0("'\\x", x, "'"))[[1]]
      })
    
      actual_chars_utf <- iconv(actual_chars_current, to = "UTF-8")
    
      mangled_chars_utf <- vapply(actual_chars_utf, FUN.VALUE = character(1), 
      function(
        Encoding(x) <- "Windows-1252"
        x
      })
    
      out <- actual_chars_utf
      names(out) <- mangled_chars_utf
      out
    }
    
    text_file <- tempfile()
    curl::curl_download("https://dl.dropboxusercontent.com/s/7syikmmiduubsqv/test.txt", text_file)
    test_text <- readr::read_lines(text_file)
    
    utf_fix <- create_utf_crosswalk()
    
    stringr::str_replace_all(test_text, utf_fix)
    #> [1] "{Província}"
    

    更新

    想出了一个直接的解决方案,它适用于示例文本,但不适用于完整文件(也许我没有指定完全正确的文件编码)。

    text <- readLines("https://dl.dropboxusercontent.com/s/7syikmmiduubsqv/test.txt")
    
    fixed <- iconv(text, from = "UTF-8", to = "Windows-1252")
    Encoding(fixed) <- "UTF-8"
    
    fixed
    

    【讨论】:

    • 最后一个代码块终于解决了我的问题!
    【解决方案2】:

    由于声誉低,我不允许发表评论,但你的功能帮助了我。我发布的原因是函数中存在一些错误(括号错误和actual_chars_current 未定义)。

    编辑:

     create_utf_crosswalk <- function() {
        # Affects Windows-1252 0x80 - 0xFF (but a few characters aren't in
        # the spec, so  remove them)
        hex_codes <- sprintf("%x", seq(strtoi("0x80"), strtoi("0xFF")))
        hex_codes <- hex_codes[!hex_codes %in% c("81", "8d", "8f", "90", "9f")]
    
        actual_chars_locale <- vapply(hex_codes, FUN.VALUE = character(1), function(x) {
          parse(text = paste0("'\\x", x, "'"))[[1]]
        })
    
        actual_chars_utf <- iconv(actual_chars_locale, to = "UTF-8")
    
        mangled_chars_utf <- vapply(actual_chars_utf, FUN.VALUE = character(1), 
                                    function(x){
                                      Encoding(x) <- "Windows-1252"
                                      x
                                    })
    
        out <- actual_chars_utf
        names(out) <- mangled_chars_utf
        out
      }
    

    【讨论】:

    • 很好,很高兴您能够让它工作,感谢您的发帖!
    • 我也对遇到编码错误表示哀悼,这些事情真的很痛苦
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-08-09
    • 1970-01-01
    • 1970-01-01
    • 2014-01-18
    • 2016-09-14
    相关资源
    最近更新 更多