将不带 .xml 扩展名的 xml 文件读入 R答案

【问题标题】：Read xml file without .xml extension into R将不带 .xml 扩展名的 xml 文件读入 R
【发布时间】：2014-08-01 12:54:49
【问题描述】：

我在将 xml 文件读入 R 时遇到问题。问题是，这个 xml 文件没有 .xml 扩展名。

我通常会遵循以下描述的方法：

library(XML)

xml.url <- "http://www.w3schools.com/xml/plant_catalog.xml"

使用xmlTreeParse和readLines函数解析xml文件：

xmlfile <- xmlTreeParse(readLines(xml.url))

但是，我不知道如何解析下面网页中的内容。它没有 .xml 扩展名。

my_file <- 
  paste0("http://ec.europa.eu/public_opinion/cf/",
         "exp_feed.cfm?keyID=1&nationID=",
         "11,1,27,28,17,2,16,18,13,32,6,3,4,",
         "22,33,7,8,20,21,9,23,31,34,24,12,19,",
         "35,29,26,25,5,14,10,30,15,",
         "&startdate=1973.09&enddate=",
         "2014.06")

my_xml_file <- xmlTreeParse(readLines(my_file))

我收到此错误：

Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0x6F 0x6E 0x20
Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xE7 0x6F 0x6E 0x20

所以，网页没有扩展名，解析会抛出与编码有关的错误。我在上面的函数中尝试了编码参数的运气......没有运气。

【问题讨论】：

标签： xml r

【解决方案1】：

这与缺少xml 扩展无关。这并不重要。

问题似乎与文件的编码有关。这个地区的事情似乎变得有趣了：

xx <- readLines(my_file); 
xx[114633:114646]

XML 解析器不认为这是正确的 UTF-8 编码

你可以用R转换数据

yy <- iconv(ll, to="UTF-8")
my_xml_file <- xmlTreeParse(yy)

注意：这将取出带有坏字节的行。这意味着您将丢失数据。丢失的行是

which(is.na(yy))
# [1] 114637 114643 114685 114755 114776 114832 
# [7] 114881 114895 114902 115422 115429 115436

所以和

my_xml_file <- xmlTreeParse(xx[-which(is.na(yy))])

幸运的是，您的文件仍然可以解析而没有丢失的行。

【讨论】：

【解决方案2】：

尝试先使用 httr 将其导入 R，然后让 content 函数将其转换为更可用的格式：

library('httr')
my_file <- 
  paste0("http://ec.europa.eu/public_opinion/cf/",
         "exp_feed.cfm?keyID=1&nationID=",
         "11,1,27,28,17,2,16,18,13,32,6,3,4,",
         "22,33,7,8,20,21,9,23,31,34,24,12,19,",
         "35,29,26,25,5,14,10,30,15,",
         "&startdate=1973.09&enddate=",
         "2014.06")
x <- GET(my_file)
z <- xmlToList(content(x))

结果：

> str(z, 3)
List of 1
 $ Table:List of 2
  ..$ Grid       :List of 35
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2
  .. ..$ AxisZ:List of 2

【讨论】：