【发布时间】:2018-01-16 05:48:09
【问题描述】:
我有一个文件明确表示它是 UTF-8,unix 命令file -i 表示它被编码为 UTF-8,但是当我将它加载到 R 中时(使用带有 UTF8 编码的 readr),我仍然可以清楚告诉多字节字符是错误的。当我指定“Windows-1252”(基于this chart,我很确定它最初的编码是什么)作为编码时,我得到更多不正确的字符。
我认为发生的事情是有人将这些不正确的字符保存为 UTF-8。有没有办法恢复原文?
以下是通过指定编码进行修复的尝试:
library(curl)
library(readr)
#>
#> Attaching package: 'readr'
#> The following object is masked from 'package:curl':
#>
#> parse_date
text_file <- tempfile()
curl_download("https://dl.dropboxusercontent.com/s/7syikmmiduubsqv/test.txt", text_file)
# Default is UTF-8, other specifications add extra characters
read_lines(text_file)
#> [1] "{ProvÃÂncia}"
# read_lines(text_file, locale = locale(encoding = "UTF-8")) # same
read_lines(text_file, locale = locale(encoding = "Windows-1252"))
#> [1] "{ProvÃ<U+0083>ÂÂncia}"
read_lines(text_file, locale = locale(encoding = "latin1"))
#> [1] "{ProvÃ<U+0083>ÂÂncia}"
# Same as equivalent readr code
# readLines(text_file)
# readLines(text_file, encoding = "UTF-8")
# readLines(text_file, encoding = "UTF-8-BOM")
# readLines(text_file, encoding = "Windows-1252")
# Desired text: "{Prov\u00EDncia}"
更新
反向编码(例如 Stat545 example)不起作用
iconv(read_lines(text_file), from = "UTF-8", to = "Latin1")
#> [1] "{ProvÃncia}"
iconv(read_lines(text_file), from = "UTF-8", to = "Windows-1252")
#> [1] "{ProvÃncia}"
【问题讨论】:
标签: r encoding utf-8 character-encoding