为什么会出现乱码？答案

【问题标题】：Why do I get garbled characters？为什么会出现乱码？
【发布时间】：2012-09-02 03:13:49
【问题描述】：

为什么解析网页时会出现乱码？

我已经使用encoding="big-5\\IGNORE"获取正常字符，但它不起作用。

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5\\IGNORE")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)

我应该如何修改我的代码以将乱码变为正常？

@MartinMorgan（下）建议使用

htmlParse(url,isURL=TRUE,encoding="big-5")

下面是一个例子：

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock

总记录应该是 1335。在上面的例子中是 309 - 许多记录似乎已经丢失了

这是一个复杂的问题。有很多问题：

格式错误的 html 文件

网络不是标准网络，不是格式良好的 html 文件，让我证明我的观点。
请运行：

url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)

如何打开下载的文件stockbig-5wiht firefox？

R中的Iconv函数错误
如果 html 文件格式正确，您可以使用

data=readLines(文件)
datachange=iconv(data,from="source encode",to="target encode\IGNORE")

当一个 html 文件格式不正确时，你可以这样做，在这个例子中，
请跑，

data=readLines(stockbig-5)

会发生错误。

1: In readLines("stockbig-5") :  
  invalid input found on input connection 'stockbig-5'

您不能在 R 中使用 iconv 函数来更改格式错误的 html 文件中的编码。

你可以，但是在 shell 中这样做

【问题讨论】：

XML 包含<meta http-equiv="Content-Type" content="text/html; charset=big5" />。你为什么要解析它，因为它有 bg2312 字符集？
我试了一下big-5，还是乱码。
sessionInfo() 的输出将是对问题的有用补充。
> sessionInfo() R 版本 2.15.1 (2012-06-22) 平台：i486-pc-linux-gnu（32 位）语言环境：[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=C LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT =en_US.UTF-8 LC_IDENTIFICATION=C 附加基础包：[1] stats graphics grDevices utils datasets 方法基础其他附加包：[1] XML_3.9-4

标签： r parsing encode

【解决方案1】：

我自己解决了一个晚上，很难。
系统：debian6(locale utf-8)+R2.15(locale utf-8)+gnome terminal(locale utf-8)。
代码如下：

require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
system('iconv -f big-5  -t  UTF-8//IGNORE    stockbig-5  > stockutf-8')
data=htmlParse("stockutf-8",isURL=FALSE,encoding="utf-8\\IGNORE")
tdata=xpathApply(data,"//table[@class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock

我希望我的代码更优雅，R代码中的shell命令可能很丑，

system('iconv -f big5 -t UTF-8//IGNORE stockgb2312 > stockutf-8')

我尝试用纯R代码替换它，失败了，如何用纯R代码替换它？您可以使用代码在计算机中复制结果。完成了一半，成功了一半，继续尝试。

【讨论】：

这个 data=htmlParse(url, isURL=TRUE, encoding="big5")（没有单独的文件下载，只有 big5 编码）对我有用，类似的 sessionInfo()
如果你使用data=htmlParse(url, isURL=TRUE, encoding="big5")，你会丢失很多记录，你可以获得公司的一部分，其中很多丢失