将 data.fame 对象从 latin1 重新编码为 utf-8答案

【问题标题】：Recoding data.fame object from latin1 to utf-8将 data.fame 对象从 latin1 重新编码为 utf-8
【发布时间】：2016-04-02 11:14:41
【问题描述】：

我使用带有重音符号的数据的 Windows 7（我的系统：“LC_COLLATE=French_France.1252”）。
我的数据是用 ANSI 编码的，这使我可以在 Rstudio 的选项卡中正确地可视化它们。

我的问题：当我想创建 GoogleVis 页面（编码 utf-8）时，重音字符显示不正确。

我的预期：我希望在创建 googleVis 页面之前用 R 将我的 latin1 Data.frames 转换为 utf-8。我没有想法。 Stringi 包似乎只适用于原始数据。

fr <- data.frame(âge = c(15,20), prénom = c("Adélia", "Adão"), row.names = c("I1", "I2"))

print (fr)

library (googleVis)

test <- gvisTable(fr)
plot(fr)

真实数据 https://drive.google.com/open?id=0B91cr4hfMXV4OEkzWk1aWlhvR0E

# importing (historical data)
test_ansi<-read.table("databig_ansi.csv",
                header=TRUE, sep=",",
                na.strings="",
                quote = "\"",
                dec=".") 

# subsetting 
library (dplyr)
test_ansi <- 
   test_ansi %>%
   count(ownera)

# library (stringi)

  stri_enc_detect(test_ansi$ownera)

# visualisation
library (googleVis)

testvis <- gvisTable(test_ansi)
plot(testvis)

【问题讨论】：

标签： r utf-8 character-encoding latin1 googlevis

【解决方案1】：

在几个包中都有内置函数，如stringi、stringr、SoundexBR、tau，以及R基础系统中的字符转换，可以用作：

text2 <- iconv(text, from = "latin1", to = "UTF-8")

您可能还需要一个更具体的函数，其中包含一些因素检查，如下所示：

.fromto <- function (x, from, to)
{
    if (is.list(x)) {
    xattr <- attributes(x)
    x <- lapply(x, .fromto, from, to)
    attributes(x) <- xattr
    } else {
    if (is.factor(x)) {
        levels(x) <- iconv(levels(x), from, to, sub = "byte")
    } else {
        if (is.character(x))
        x <- iconv(x, from, to, sub = "byte")
    }
    lb <- attr(x, "label")
    if (length(lb) > 0) {
        attr(x, "label") <- iconv(attr(x, "label"), from, to, sub = "byte")
    }
    }
    x
}

# This will convert a vector from any encoding into UTF-8
Latin2UTF8 <- function (x, from = "WINDOWS-1252")
{
    .fromto(x, from, "UTF-8")
}

那么你只需将它用作：

Latin2UTF8(fr)
 âge prénom
I1  15 Adélia
I2  20   Adão

额外信息和数据后的额外编辑

这就是我的 R 的设置方式。默认情况下，我的 R 在 UTF-8 语言环境和英语上运行。一旦我的系统环境与提供的文件编码不同，我将使用fileEncoding = "LATIN1"。就是这样。

> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"


test_ansi<-read.table(file.choose(),
                       header=TRUE, sep=",",
                       na.strings="",
                        quote = "\"",
                        dec=".", fileEncoding = "LATIN1")

> test_ansi2 <- 
+     test_ansi %>%
+     count(ownera)
> test_ansi2
Source: local data frame [6,482 x 2]

                ownera n
1       Abautret (Vve) 1
2              Abazuza 1
3            Abernathy 1
4  Abrahamsen, Heerman 1
5  Abrahamsen, Hereman 6
6   Abrahamsz, Heerman 2
7         Abram, Ralph 8
8      Abrams, Heerman 2
9            Abranches 1
10               Abreu 1
..                 ... .

# visualisation
library (googleVis)


testvis <- gvisTable(test_ansi)
plot(testvis)

Link to the table created

【讨论】：

@Wilcar 究竟是什么不起作用？错误信息是什么？它对我来说很好，可能是您可以读取已经以所需格式的数据，或者暂时更改系统的区域设置编码。我会在我的答案中添加额外的东西。
这里有 3 个屏幕截图：RStdudio 提供了很好的 ansi 数据可视化，但在 GoogleVis 中却没有（查看带有重音符号的名称）。也许我不知道使用你的功能？ [![import ansi][1]][1] [![Rstudio][2]][2] [![GoogleVis][3]][3] [1]:i.stack.imgur.com/F5JHI.jpg[2]:@987654324 @[3]：i.stack.imgur.com/gYr16.jpg
我发布了 3 个屏幕截图。在我的最后评论中。也许我没有正确使用您的功能：当重新编码我的 d.f. RStudio 中的数据可视化仍然很好。（系统窗口 1252）
这会很尴尬，但也许你遇到的问题不是R读取和存储字符串的方式，而是GoogleVis从它继承的sys locale。
请注意，在我上次的编辑中，我什至没有使用我提出的功能。一旦我在 UTF-8 编码上运行，我只需使用正确的编码加载数据。如果您使用库（字符串）； stri_enc_detect(test_ansi); stri_enc_isutf8(test_ansi);你得到了什么？您可能会遇到编码反弹，这可能会给 GoogleVis 造成一些混乱。