使用 R 的 xmlEventParse 存储特定的 XML 节点值答案

【问题标题】：Storing specific XML node values with R's xmlEventParse使用 R 的 xmlEventParse 存储特定的 XML 节点值
【发布时间】：2011-11-24 02:30:53
【问题描述】：

我有一个很大的 XML 文件，我需要用 xmlEventParse in R 解析它。不幸的是，网上的例子比我需要的复杂，我只想标记一个匹配的节点标签来存储匹配的节点文本（不是属性），每个文本在一个单独的列表中，见下面代码中的 cmets：

library(XML)
z <- xmlEventParse(
    "my.xml", 
    handlers = list(
        startDocument   =   function() 
        {
                cat("Starting document\n")
        },  
        startElement    =   function(name,attr) 
        {
                if ( name == "myNodeToMatch1" ){
                    cat("FLAG Matched element 1\n")
                }
                if ( name == "myNodeToMatch2" ){
                    cat("FLAG Matched element 2\n")
                }
        },
        text            =   function(text) {
                if ( # Matched element 1 .... )
                    # Store text in element 1 list
                if ( # Matched element 2 .... )
                    # Store text in element 2 list
        },
        endDocument     =   function() 
        {
                cat("ending document\n")
        }
    ),
    addContext = FALSE,
    useTagName = FALSE,
    ignoreBlanks = TRUE,
    trim = TRUE)
z$ ... # show lists ??

我的问题是，如何在 R 中实现这个标志（以专业的方式:)？另外：评估 N 个任意节点以匹配的最佳选择是什么... if name = "myNodeToMatchN" ... 节点避免大小写匹配？

my.xml 可能只是一个简单的 XML

<A>
  <myNodeToMatch1>Text in NodeToMatch1</myNodeToMatch1>
  <B>
    <myNodeToMatch2>Text in NodeToMatch2</myNodeToMatch2>
    ...
  </B>
</A>

【问题讨论】：

如果我们有“my.xml”可以方便地尝试一下，那就太好了...

标签： r xml-parsing sax

【解决方案1】：

我将使用来自example(xmlEventParse) 的fileName 作为可重现的示例。它有标签record 有一个属性id 和我们想要提取的文本。我不会使用handler，而是使用branches 参数。这就像一个处理程序，但可以访问整个节点而不仅仅是元素。这个想法是编写一个闭包，它有一个地方来保存我们积累的数据，以及一个处理我们感兴趣的 XML 文档的每个分支的函数。所以让我们从定义闭包开始——为了我们的目的，一个函数返回函数列表

ourBranches <- function() {

我们需要一个地方来存储我们积累的结果，选择一个环境，以便插入时间是恒定的（不是一个列表，我们必须附加到它并且内存效率低下）

    store <- new.env()

事件解析器期望在发现匹配标记时调用函数列表。我们对record 标签感兴趣。我们编写的函数将接收 XML 文档的一个节点。我们想要提取一个元素id，我们将使用它来存储节点中的（文本）值。我们将这些添加到我们的商店中。

    record <- function(x, ...) {
        key <- xmlAttrs(x)[["id"]]
        value <- xmlValue(x)
        store[[key]] <- value
    }

处理完文档后，我们想要一种方便的方式来检索结果，因此我们为自己的目的添加了一个函数，独立于文档中的节点

    getStore <- function() as.list(store)

然后通过返回函数列表来完成闭包

    list(record=record, getStore=getStore)
}

这里有个棘手的概念是定义函数的环境是函数的一部分，所以每次我们说ourBranches() 我们都会得到一个函数列表和一个新的环境@987654335 @ 保留我们的结果。要使用，请在我们的文件上调用xmlEventParse，并使用一组空的事件处理程序，并访问我们累积的存储。

> branches <- ourBranches()
> xmlEventParse(fileName, list(), branches=branches)
list()
> head(branches$getStore(), 2)
$`Hornet Sportabout`
[1] "18.7   8 360.0 175 3.15 3.440 17.02  0  0    3 "

$`Toyota Corolla`
[1] "33.9   4  71.1  65 4.22 1.835 19.90  1  1    4 "

【讨论】：

【解决方案2】：

对于可能尝试向 M.Morgan 学习的其他人 - 这是完整的代码

fileName = system.file("exampleData", "mtcars.xml", package = "XML")

ourBranches <- function() {
  store <- new.env() 
  record <- function(x, ...) {
    key <- xmlAttrs(x)[["id"]]
    value <- xmlValue(x)
    store[[key]] <- value
  }
  getStore <- function() as.list(store)
  list(record=record, getStore=getStore)
}

branches <- ourBranches()
xmlEventParse(fileName, list(), branches=branches)
head(branches$getStore(), 2)

【讨论】：

【解决方案3】：

branchs 方法不保留事件的顺序。换句话说，branches$getStore() 存储中的“记录”顺序与原始 xml 文件中的顺序不同。另一方面，处理程序方法可以保留顺序。代码如下：

fileName <- system.file("exampleData", "mtcars.xml", package="XML")
records <- new('list')
variable <- new('character')
tag.open <- new('character')
nvar <- 0
xmlEventParse(fileName, list(startElement = function (name, attrs) {
  tagName <<- name
  tag.open <<- c(name, tag.open)
  if (length(attrs)) {
    attributes(tagName) <<- as.list(attrs)
  }
}, text = function (x) {
  if (nchar(x) > 0) {
    if (tagName == "record") {
      record <- list()
      record[[attributes(tagName)$id]] <- x
      records <<- c(records, record)
    } else {
      if( tagName == 'variable') {
        v <- x
        variable <<- c( variable, v)
        nvar <<- nvar + 1
      }
    }
  }
}, endElement = function (name) {
  if( name == 'record') {
    print(paste(tag.open, collapse='>'))
  }
  tag.open <<- tag.open[-1]
}))

head(records,2)
$``Mazda RX4``
[1] "21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4"

$`Mazda RX4 Wag`
[1] "21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4"

variable
[1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

使用处理程序的另一个好处是可以捕获层次结构。换句话说，也有可能拯救祖先。这个过程的关键点之一是全局变量的使用，可以用“

【讨论】：