【问题标题】:R - How to convert XML to dataframe in R with the correct structure?R - 如何以正确的结构将 XML 转换为 R 中的数据框?
【发布时间】:2015-12-14 17:30:46
【问题描述】:

我想将 XML 文件转换为数据框。我找到了一些允许我读取 XML 数据的函数,但是我无法获得与初始 XML 文件具有相同结构的数据框(= 如果您在 Excel 中打开 XML 文件,您将获得的结构)。

这是我的原始 XML 代码:

<Data>
<Frame timestamp='17/09/2014  20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<TrackingFrame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>

这让我已经获得了数据列表: 库(XML)

# Convert xml data to R
data <- xmlTreeParse(file="c:/R/CL/filename.xml",useInternalNode=TRUE)
# Create a list of the data
xl<-xmlToList(data)

理想情况下,我希望获得一个基于此 XML 数据的数据框,该数据框看起来与在 Excel 中输入 XML 数据时相同。但是,当我查看 xl 的输出时,我发现这是按对象和时间组织的。通常,当我在 Excel 中打开 XML 文件时,此信息是链接的(每个对象也有包含时间信息的列)

这是 xl

$Frame$Object
     type         DISTANCE         VOLUME        id       code 
"Taxi"    "3037"    "1668"   "15593"       "0" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3605"   "931" "15603"     "4" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "3563"   "488" "15604"     "9" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "2161"  "1592" "15615"    "21" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "4942"    "57" "15624"     "1" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"   "784"    "47" "15625"    "10" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3301"  "2041" "15626"    "42" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "2040"  "2945" "15630"    "27" 


$Frame$Object
  type      DISTANCE      VOLUME      Z 
"Airplane" "2865" "2722"    "0" 

$Frame$Time
                timestamp                  timecode 
"17/09/2014 20:54:59.902"                "75299902"

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "4941"    "51" "15624"     "1" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"   "789"    "47" "15625"    "10" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 "Taxi"  "3300"  "2069" "15626"    "42" 

$Frame$Object
   type       DISTANCE       VOLUME      id     code 
 “Bus”  "2027"  "2947" "15630"    "27" 

$Frame$Object
  type      DISTANCE      VOLUME      Z 
"Airplane" "2865" "2722"    "0" 

$Frame$Time
                timestamp                  timecode 
"17/09/2014 20:54:59.771"                "75299771"

此列表包含 2 个表结构/帧:Frame$Object 和 Frame$Time。我想将这两种结构组合成一个组合表(通过重复列时间戳和时间码以及每个对象的时间信息)。

在下面查看所需的输出(与在 Excel 中输入 XML 文件时的结构相同):

type    DISTANCE    VOLUME  id  code    z   timestamp   timecode
Taxi    3037    1668    15593   0       17/09/2014 20:54:59.902 75299902
Taxi    3605    931 15603   4       17/09/2014 20:54:59.902 75299902
Bus 3563    488 15604   9       17/09/2014 20:54:59.900 75299902
Taxi    4942    57  15624   1       17/09/2014 20:54:59.900 75299902
Taxi    784 47  15625   10      17/09/2014 20:54:59.900 75299902
Taxi    3301    2041    15626   42      17/09/2014 20:54:59.900 75299902
Bus 2040    2945    15630   27      17/09/2014 20:54:59.900 75299902
Airplane    2865    2722            0   17/09/2014 20:54:59.900 75299902
Taxi    4941    51  15624   1        17/09/2014 20:54:59.771    75299771
Taxi    789 47  15625   10       17/09/2014 20:54:59.771    75299771
Taxi    3300    2069    15626   42       17/09/2014 20:54:59.771    75299771
Bus 2027    2947    15630   27       17/09/2014 20:54:59.771    75299771
Airplane    2865    2722            0    17/09/2014 20:54:59.771    75299771

哪些函数可以达到这个结果?预先感谢您的帮助!

【问题讨论】:

  • 你试过xmlParse,伴随着getNodeSet/xpathApply吗?当您了解它的工作原理后,您可以使用 apply 将所有对象合并到一个数据框中。

标签: xml r


【解决方案1】:

您可以使用xml2dplyr 进行快速转换:

library(xml2)
library(dplyr)

dat <- "<Data>
<Frame timestamp='17/09/2014  20:55:00.902' timecode='75299902' >
<Object type='Taxi' DISTANCE='3037' VOLUME='1668' id='15593' code='0' />
<Object type='Taxi' DISTANCE='3605' VOLUME='931' id='15603' code='4' />
<Object type='Bus' DISTANCE='3563' VOLUME='488' id='15604' code='9' />
<Object type='Taxi' DISTANCE='4942' VOLUME='57' id='15624' code='1' />
<Object type='Taxi' DISTANCE='784' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3301' VOLUME='2041' id='15626' code='42' />
<Object type='Bus' DISTANCE='2040' VOLUME='2945' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
<Frame timestamp='17/09/2014 20:54:59.771' timecode='75299771' >
<Object type='Taxi' DISTANCE='4941' VOLUME='51' id='15624' code='1' />
<Object type='Taxi' DISTANCE='789' VOLUME='47' id='15625' code='10' />
<Object type='Taxi' DISTANCE='3300' VOLUME='2069' id='15626' code='42' />
<Object type='Bus' DISTANCE='2027' VOLUME='2947' id='15630' code='27' />
<Object type='Airplane' DISTANCE='2865' VOLUME='2722' Z='0' />
</Frame>
</Data>"

doc <- read_xml(dat)

# bind the data.frames built in the iterator together
bind_rows(lapply(xml_find_all(doc, "//Frame"), function(x) {

  # extract the attributes from the parent tag as a data.frame
  parent <- data.frame(as.list(xml_attrs(x)), stringsAsFactors=FALSE)

  # make a data.frame out of the attributes of the kids
  kids <- bind_rows(lapply(xml_children(x), function(x) as.list(xml_attrs(x))))

  # combine them
  cbind.data.frame(parent, kids, stringsAsFactors=FALSE)

}))

## Source: local data frame [13 x 8]
## 
##                   timestamp timecode     type DISTANCE VOLUME    id  code     Z
##                       (chr)    (chr)    (chr)    (chr)  (chr) (chr) (chr) (chr)
## 1  17/09/2014  20:55:00.902 75299902     Taxi     3037   1668 15593     0    NA
## 2  17/09/2014  20:55:00.902 75299902     Taxi     3605    931 15603     4    NA
## 3  17/09/2014  20:55:00.902 75299902      Bus     3563    488 15604     9    NA
## 4  17/09/2014  20:55:00.902 75299902     Taxi     4942     57 15624     1    NA
## 5  17/09/2014  20:55:00.902 75299902     Taxi      784     47 15625    10    NA
## 6  17/09/2014  20:55:00.902 75299902     Taxi     3301   2041 15626    42    NA
## 7  17/09/2014  20:55:00.902 75299902      Bus     2040   2945 15630    27    NA
## 8  17/09/2014  20:55:00.902 75299902 Airplane     2865   2722    NA    NA     0
## 9   17/09/2014 20:54:59.771 75299771     Taxi     4941     51 15624     1    NA
## 10  17/09/2014 20:54:59.771 75299771     Taxi      789     47 15625    10    NA
## 11  17/09/2014 20:54:59.771 75299771     Taxi     3300   2069 15626    42    NA
## 12  17/09/2014 20:54:59.771 75299771      Bus     2027   2947 15630    27    NA
## 13  17/09/2014 20:54:59.771 75299771 Airplane     2865   2722    NA    NA     0

您需要根据需要转换类型。

如果你被 XML 包卡住,你可以做类似的事情:

doc <- xmlParse(dat)

bind_rows(xpathApply(doc, "//Frame", function(x) {
  parent <- data.frame(as.list(xmlAttrs(x)), stringsAsFactors=FALSE)
  kids <- bind_rows(lapply(xmlChildren(x), function(x) as.list(xmlAttrs(x))))
  cbind.data.frame(parent, kids, stringsAsFactors=FALSE)
}))

【讨论】:

  • 这按原样工作,只是更改了标签,具有相同结构的 XML。太棒了!
  • 很好的解决方案。唯一的调整可能是使用purrr::map_df 而不是bind_rows(apply(,这将是更现代的tidyverse 风格。
【解决方案2】:

试试

data <- xmlParse(file="c:/R/CL/filename.xml")

还有类似的:

sapply(getNodeSet(data, "//Frame/Object[@type]"), xmlValue)

它应该给你一个节点框架下所有节点对象类型的向量。 更多在这里: http://www.w3schools.com/xsl/xpath_syntax.asp

【讨论】:

    【解决方案3】:

    考虑XML 库的xpathsapply() 路由以及解决方法以检索每个子级的timestamptimecode 并处理idcode 的缺失属性:

    library(XML)
    
    doc <- xmlParse("C:/Path/To/XML/File.xml")
    
    # RETRIEVE FRAME ATTRS DATA FOR EACH OBJECT CHILD
    timestamp <- c()
    timecode <- c()
    numberofobjs <- length(xpathSApply(doc, "//Object"))
    for (i in (1:numberofobjs)) {
        timestamp <- c(timestamp, xpathSApply(doc, sprintf("//Object[%s]/ancestor::Frame", i), 
                                              xmlGetAttr, "timestamp"))
        timecode <- c(timecode, xpathSApply(doc, sprintf("//Object[%s]/ancestor::Frame", i), 
                                            xmlGetAttr, "timecode"))
    }
    
    # XPATH TO EACH ATTRIBUTE
    type <- xpathSApply(doc, "//Object", xmlGetAttr, "type")
    distance <- xpathSApply(doc, "//Object", xmlGetAttr,"DISTANCE")
    volume <- xpathSApply(doc, "//Object", xmlGetAttr, "VOLUME")
    id <- xpathSApply(doc, "//Object", xmlGetAttr, "id")
    id <- sapply(id, function(x) ifelse(is.null(x), NA, x))     # REMOVE NULLS
    code <- xpathSApply(doc, "//Object", xmlGetAttr, "code")
    code <- sapply(id, function(x) ifelse(is.null(x), NA, x))   # REMOVE NULLS
    
    # COMBINE LISTS INTO DATA FRAME
    xmldf <- data.frame(timecode = unlist(timecode), 
                        timestamp = unlist(timestamp), 
                        type = unlist(type), 
                        distance = unlist(distance), 
                        volume = unlist(volume), 
                        id = unlist(id), 
                        code = unlist(code))
    

    【讨论】:

      猜你喜欢
      • 2015-01-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-11-30
      • 2021-12-22
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多