【问题标题】:Parse XML with getNodeSet - Identify missing tags使用 getNodeSet 解析 XML - 识别缺失的标签
【发布时间】:2013-08-20 12:52:52
【问题描述】:

我正在使用getNodeSet() 解析一个 XML 文件。假设我有一个来自书店的 XML 文件,其中列出了 4 种不同的书籍,但是对于一本书,标签“作者”丢失了。

如果我使用data.nodes.2 <- getNodeSet(data,'//*/authors') 解析标签“authors”的 XML,R 会返回一个包含 3 个元素的列表。

然而,这并不是我想要的。如何让“getNodeSet()”返回一个包含 4 个元素而不是 3 个元素的列表,即一个元素具有缺失值且标签“authors”不存在。

感谢您的帮助。

library(XML)

file <- "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\r\n<!-- Edited by XMLSpy® -->\r\n<bookstore>\r\n<book category=\"cooking\">\r\n<title lang=\"en\">Everyday Italian</title>\r\n<authors>\r\n<author>Giada De Laurentiis</author>\r\n</authors>\r\n<year>2005</year>\r\n<price>30.00</price>\r\n</book>\r\n<book category=\"children\">\r\n<title lang=\"en\">Harry Potter</title>\r\n<authors>\r\n<author>J K. Rowling</author>\r\n</authors>\r\n<year>2005</year>\r\n<price>29.99</price>\r\n</book>\r\n<book category=\"web\">\r\n<title lang=\"en\">XQuery Kick Start</title>\r\n<authors>\r\n<author>James McGovern</author>\r\n<author>Per Bothner</author>\r\n<author>Kurt Cagle</author>\r\n<author>James Linn</author>\r\n<author>Vaidyanathan Nagarajan</author>\r\n</authors>\r\n<year>2003</year>\r\n<price>49.99</price>\r\n</book>\r\n<book category=\"web\" cover=\"paperback\">\r\n<title lang=\"en\">Learning XML</title>\r\n\r\n<year>2003</year>\r\n<price>39.95</price>\r\n</book>\r\n</bookstore>"

data <- xmlParse(file)

data.nodes.1 <- getNodeSet(data,'//*/book')

data.nodes.2 <- getNodeSet(data,'//*/authors')


# Data

# <?xml version="1.0" encoding="ISO-8859-1"?>
# <!-- Edited by XMLSpy® -->
# <bookstore>
#   <book category="cooking">
#     <title lang="en">Everyday Italian</title>
#     <authors>
#       <author>Giada De Laurentiis</author>
#     </authors>
#     <year>2005</year>
#     <price>30.00</price>
#   </book>
#   <book category="children">
#     <title lang="en">Harry Potter</title>
#     <authors>
#       <author>J K. Rowling</author>
#     </authors>
#     <year>2005</year>
#     <price>29.99</price>
#   </book>
#   <book category="web">
#     <title lang="en">XQuery Kick Start</title>
#     <authors>
#       <author>James McGovern</author>
#       <author>Per Bothner</author>
#       <author>Kurt Cagle</author>
#       <author>James Linn</author>
#       <author>Vaidyanathan Nagarajan</author>
#     </authors>
#     <year>2003</year>
#     <price>49.99</price>
#   </book>
#   <book category="web" cover="paperback">
#     <title lang="en">Learning XML</title>
#     <year>2003</year>
#     <price>39.95</price>
#   </book>
# </bookstore>

【问题讨论】:

    标签: xml r parsing xml-parsing


    【解决方案1】:

    一种选择是使用 R 的列表处理从每个节点中提取作者

    books <- getNodeSet(doc, "//book")
    authors <- lapply(books, xpathSApply, ".//author", xmlValue)
    authors[sapply(authors, is.list)] <- NA
    

    并使用书籍级别的信息来解决这个问题

    title <- sapply(books, xpathSApply, "string(.//title/text())")
    

    给予

    >     data.frame(Title=rep(title, sapply(authors, length)),
    +                Author=unlist(authors))
                  Title                 Author
    1  Everyday Italian    Giada De Laurentiis
    2      Harry Potter           J K. Rowling
    3 XQuery Kick Start         James McGovern
    4 XQuery Kick Start            Per Bothner
    5 XQuery Kick Start             Kurt Cagle
    6 XQuery Kick Start             James Linn
    7 XQuery Kick Start Vaidyanathan Nagarajan
    8      Learning XML                   <NA>
    

    【讨论】:

    • 谢谢。正是我需要的。
    【解决方案2】:

    这是一种 xml2 方法。

    代码可读性强,因此易于维护。

    代码

    library( xml2 )
    
    #read the xml file
    data <- xml2::read_xml( file )
    
    #get all book-titles and store them in a data.frame
    books <- data.frame( 
      title = xml_find_all( data, ".//book/title" ) %>% xml_text(),
      stringsAsFactors = FALSE
      )
    
    #find all author-nodes
    authors      <- xml_find_all( data, ".//author" )
    
    #create a dataframe with all authors, an the book they wrote
    authors <- data.frame( 
      #loop over the author-nodes, and get the title from the ancestor-node (i.e. book)
      title  = xml_find_first( authors, ".//ancestor::book/title") %>% xml_text(),
      #get the text from the autor-node
      author = xml_text( authors ),
      stringsAsFactors = FALSE
      )
    
    #left_join the books with the authors
    left_join( books, authors, by = "title")
    

    输出

    #               title                 author
    # 1  Everyday Italian    Giada De Laurentiis
    # 2      Harry Potter           J K. Rowling
    # 3 XQuery Kick Start         James McGovern
    # 4 XQuery Kick Start            Per Bothner
    # 5 XQuery Kick Start             Kurt Cagle
    # 6 XQuery Kick Start             James Linn
    # 7 XQuery Kick Start Vaidyanathan Nagarajan
    # 8      Learning XML                   <NA>
    

    样本数据

    file <- '<?xml version="1.0" encoding="ISO-8859-1"?>
    <!-- Edited by XMLSpy® -->
    <bookstore>
      <book category="cooking">
        <title lang="en">Everyday Italian</title>
        <authors>
          <author>Giada De Laurentiis</author>
        </authors>
        <year>2005</year>
        <price>30.00</price>
      </book>
      <book category="children">
        <title lang="en">Harry Potter</title>
        <authors>
          <author>J K. Rowling</author>
        </authors>
        <year>2005</year>
        <price>29.99</price>
      </book>
      <book category="web">
        <title lang="en">XQuery Kick Start</title>
        <authors>
          <author>James McGovern</author>
          <author>Per Bothner</author>
          <author>Kurt Cagle</author>
          <author>James Linn</author>
          <author>Vaidyanathan Nagarajan</author>
        </authors>
        <year>2003</year>
        <price>49.99</price>
      </book>
      <book category="web" cover="paperback">
        <title lang="en">Learning XML</title>
        <year>2003</year>
        <price>39.95</price>
      </book>
    </bookstore>'
    

    【讨论】:

      【解决方案3】:

      您可以使用plyr

      library(plyr)
      > ldply(xpathApply(data, '//book', getChildrenStrings), rbind)
                    title                                                             authors year price
      1  Everyday Italian                                                 Giada De Laurentiis 2005 30.00
      2      Harry Potter                                                        J K. Rowling 2005 29.99
      3 XQuery Kick Start James McGovernPer BothnerKurt CagleJames LinnVaidyanathan Nagarajan 2003 49.99
      4      Learning XML                                                                <NA> 2003 39.95
      

      【讨论】:

      • 感谢您的帮助。但是,由于一个原因,我很遗憾不能使用您的解决方案:作者没有正确分隔,即第一作者姓名的最后一个字母附加到第二作者姓名的第一个字母。
      【解决方案4】:

      你也可以尝试一些 XML 的 xmlToDataFrame

      x <-xmlToDataFrame(doc)
      

      如果您不喜欢将作者混在一起,您有时可以通过模式匹配来解决这个问题

      x$authors <- gsub("([a-z]{2})([A-Z])", "\\1, \\2", x$authors)
      x
                    title                                                                     authors year price
      1  Everyday Italian                                                         Giada De Laurentiis 2005 30.00
      2      Harry Potter                                                                J K. Rowling 2005 29.99
      3 XQuery Kick Start James McGovern, Per Bothner, Kurt Cagle, James Linn, Vaidyanathan Nagarajan 2003 49.99
      4      Learning XML                                                                        <NA> 2003 39.95
      

      其他选项是遍历书籍节点(请参阅 ?getNodeSet 以创建和释放子节点)或按照 Martin 的回答(如果您想要 4 行,请尝试此操作)

      authors <- sapply(authors, paste, collapse=",")
      data.frame(title, authors)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2021-12-21
        • 2013-11-28
        • 2022-06-14
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2012-11-16
        相关资源
        最近更新 更多