【发布时间】:2015-02-15 13:01:04
【问题描述】:
以下代码用于解析 XML,以便将节点、父级、类型等信息提取到数据框中。它适用于行的小型 XML 文件,但是当使用超过 25,000 行的文件时,处理需要几分钟。因此,我打算优化代码以更快地处理。该函数的目的是读取任何 XML 文件并根据数据框的要求生成数据。
示例 XML:
<?xml version="1.0" encoding="UTF-8"?>
<CATALOG>
<PLANT id="1" required="false">
<COMMON Source="NLM">Bloodroot</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<DATE>
<Year>2013</Year>
</DATE>
</PLANT>
<PLANT id="2" required="true">
<COMMON Source="LNP">Columbine</COMMON>
<BOTANICAL>Aquilegia canadensis</BOTANICAL>
<DATE>
<Year>2014</Year>
</DATE>
</PLANT>
</CATALOG>
输出:
path node value parent type
1 CATALOG CATALOG NULL NULL element
2 CATALOG/PLANT PLANT NULL CATALOG element
3 CATALOG/PLANT id 1 PLANT attribute
4 CATALOG/PLANT required false PLANT attribute
5 CATALOG/PLANT/COMMON COMMON Bloodroot PLANT text
6 CATALOG/PLANT/COMMON Source NLM COMMON attribute
7 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis PLANT text
8 CATALOG/PLANT/DATE DATE NULL PLANT element
9 CATALOG/PLANT/DATE/Year Year 2013 DATE text
10 CATALOG/PLANT PLANT NULL CATALOG element
11 CATALOG/PLANT id 2 PLANT attribute
12 CATALOG/PLANT required true PLANT attribute
13 CATALOG/PLANT/COMMON COMMON Columbine PLANT text
14 CATALOG/PLANT/COMMON Source LNP COMMON attribute
15 CATALOG/PLANT/BOTANICAL BOTANICAL Aquilegia canadensis PLANT text
16 CATALOG/PLANT/DATE DATE NULL PLANT element
17 CATALOG/PLANT/DATE/Year Year 2014 DATE text
代码片段:
library(XML)
library(plyr)
## helper function of xPathApply
getValues <- function(x) {
List <- list()
# find all ancestors of a given node
ancestorNames <- character()
ancestorNamesList <- xmlAncestors(x, fun = function(y) {
ancestorNames <- c(ancestorNames, xmlName(y))})
pathName <- paste(ancestorNamesList, collapse = "/")
# find the parent of a given node
parentNode <- xmlParent(x)
parentName <- "NULL"
if(!is.null(parentNode)) {
parentName <- xmlName(parentNode)
}
if(inherits(x, "XMLInternalElementNode")) {
# check if the value of the given node exists i.e. text
if(length(xmlValue(x, recursive=FALSE)) != 0) {
List <- append(List, list(path = pathName, node = xmlName(x), value = xmlValue(x, recursive=FALSE), parent = parentName, type = "text"))
} else {
List <- append(List, list(path = pathName, node = xmlName(x), value = "NULL", parent = parentName, type = "element"))
}
}
## attributes
if(!is.null(xmlAttrs(x))) {
num.attributes = xmlSize(xmlAttrs(x))
for (i in seq_len(num.attributes)) {
# get the attribute name
attributeName <- names(xmlAttrs(x)[i])
# get the attribute value
attributeValue <- xmlAttrs(x)[[i]]
List <- append(List, list(path = pathName, node = attributeName, value = attributeValue, parent = parentName, type = "attribute"))
}
}
return(List)
}
## recursive function
visitNode <- function(node, xpath) {
if (is.null(node)) {
return()
}
# number of children of a node
num.children <- xmlSize(node)
bypass <- function(n = num.children) {
if(num.children == 0) {
xpathSApply(node, path = xpath, getValues)
} else {
return(num.children)
}
}
# recursive call to visitNode
for (i in seq_len(num.children)) {
visitNode(node[[i]], xpath)
}
# add list type result to data frame
if(is.list(result <- bypass())) {
dt <<- do.call(rbind.fill, lapply(result, data.frame))
}
}
# read XML data from the given file
xtree <- xmlParse("test.xml")
# retrieve the root of the XML
root <- xmlRoot(xtree)
# define data frame which is to hold the data interpreted from XML
dt <- data.frame(path = NA, node = NA, value = NA, parent = NA, type = NA)
# call to recursive function
visitNode(root, xpath <- "//node()")
dt
【问题讨论】:
-
我看到的主要低效率在于没有预先确定
List对象的尺寸。使用 c() 扩展列表可能非常低效。使用sapply并不能治愈这种病状。看看List <- list(xmlSize(xmlAttrs(x)) )是否只是通过i索引列表会使事情进展得更快。 -
List[[length(List)+1]]是错误的。应该是List[[i]]。我在一个示例 xml 上试过这个,它返回一个空列表 -
请理查德,让人们了解预维度。
-
是的,但是很难判断 xml 文档可能有多少属性
-
我必须说我一直想帮助解决您的 xml 问题,但它们太不清楚以至于我感到沮丧并退出。这个几乎相同,因为您没有显示任何示例数据和所需的结果。使用 xml 时必须有一些规则,因为许多节点是完全不同的,所以这个函数的结果可能是你不想要的。如果可以请为这个问题添加更多上下文,那就太棒了
标签: xml r xml-parsing xml-attribute