在 R 中解析 XML 文件：复杂结构答案

【问题标题】：parsing XML files in R: Complex Structure在 R 中解析 XML 文件：复杂结构
【发布时间】：2015-08-26 14:41:52
【问题描述】：

我需要将一个 XML 文件解析为一个简单的数据框。问题是文件很复杂。多个作者可以与一项专利相关联，专利可以使用多个标签，并且作者-机构关联不一定是一对一的。还有什么？名称分为两个字段 - 名字和姓氏。我需要检索那些而不会使作者之间的名字混淆。

与此处描述的问题 (How to transform XML data into a data.frame?) 不同，关联是 3 向的：作者/发明人与专利、专利与标签、作者与机构。其次，这些机构的报告方式并不总是相同。如果所有作者都隶属于一个机构，则该机构仅在“aff”标签中显示一次。（见下面的例子）。每个项目有多个作者的常见问题仍然存在。

挑战也和这里准备的一样，(Parsing XML file with known structure and repeating elements)，但是正如你所看到的，布局是完全不同的，委婉地说。

合成数据样本

<patents>
      <patent patno="101103062330">
        <office coden="EPO"  short="Eur. Pat. Office"> European Office    </office>
        <volume>80    </volume>
        <issue printdate="2009-12-00">6    </issue>
        <numpages>13    </numpages>
        <section code="A-2D"> Filtering    </section>
        <patno>101103062330    </patno>
        <title> trapping plastic waste    </title>
        <authgrp>
          <author>
            <givenname>Endo    </givenname>
            <surname>Wake    </surname>
          </author>
          <author>
            <givenname>C.    </givenname>
            <surname>Morde    </surname>
          </author>
          <aff> University of M, USA    </aff>
        </authgrp>
        <history>
          <received date="2009-07-01"/>
          <published date="2009-07-30"/>
        </history>
        <tag tagyr="2009">
          <tagcode>B1.C2.B5    </tagcode>
          <tagcode>F4.65.F6    </tagcode>
        </tag>
        <assignment>
          <assigndate date="2009"/>
          <rightholder> university of M    </rightholder>
        </assignment>
  </patent>
      <patent patno="101103062514">
        <office coden="EPO"  short="Eur. Pat. Office"> European Office    </office>
        <issue printdate="2009-12-00">6    </issue>
        <numpages>15    </numpages>
        <section code="A-3D"> structure and dynamics    </section>
        <patno>101103062514    </patno>
        <title> separation of cascades and photon emission    </title>
        <authgrp>
          <author affref="a1 a2">
            <givenname>L.    </givenname>
            <surname>Slabsky    </surname>
          </author>
          <author affref="a1">
            <givenname>D.    </givenname>
            <surname>Volosvyev    </surname>
          </author>
          <author affref="a3">
            <givenname>G.    </givenname>
            <surname>Nonpl    </surname>
          </author>
          <aff affid="a1"> Institute of Physics,Russia    </aff>
          <aff affid="a2"> Physics Institute, St. Petersburg     </aff>
          <aff affid="a3">Technische Universiteit, Dresden    </aff>
        </authgrp>
        <history>
          <received date="2009-01-11"/>
          <published date="2009-01-31"/>
        </history>
        <tag tagyr="2009">
          <tagcode>A1.B2.C3    </tagcode>
        </tag>
        <assignment>
          <assigndate date="2009"/>
          <rightholder> Physics Inst    </rightholder>
        </assignment>
  </patent>
</patents>

我想从这个 xml 文件中获取三个表。

第一个简单地将作者/发明人与其专利匹配，第二个将专利与标签匹配，而第三个将发明人/作者与机构匹配：

表 1 示例

`Patent Author1 Author2 Author3
101103062330 Endo Wake  C. Morde 
101103062514 L. Slabsky D.Volosyev  G. Nonpl`

长表格式也不错。

    `Patent Author
    101103062330 Endo Wake
    101103062330 C. Morde
    101103062514  L. Slabsky
    101103062514  D.Volosyev
    101103062514  G. Nonpl`

表 2 示例

    `Patent Tag
    101103062330 B1.C2.B5
    101103062330 F4.65.F6
    101103062514 A1.B2.C3`

表 3 示例

   `Author  Institution
    Endo Wake   University of M
    C. Morde        University of M
    L. Slabsky      Institute of Physics,Russia
    D.Volosyev  Physics Institute, St. Petersburg
    G. Nonpl        Technische Universiteit, Dresden`

我尝试使用：

xmlfile     <- xmlInternalTreeParse("filename.xml", useInternal = T)
nodes     <- getNodeSet(xmlfile, "//patent")
authors     <- lapply(nodes, xpathSApply, ".//author", xmlValue)
patent     <- sapply(nodes, xpathSApply, ".//patent", xmlValue)

运气不好。它不会解析组内的作者姓名。

我也试过了：

dt1 <- ldply(xmlToList(xmlfile), data.table)

运气不好。我得到了一个表格，其中第 1 列是“专利”，第 2 列是各种数据。

我是 XML 包的新手，所以我希望得到一些支持。

【问题讨论】：

到目前为止您尝试过的代码是……
@hrbmstr 查看问题更新

标签： xml r

【解决方案1】：

试试这个。

表 1

lapply(
  getNodeSet(patents, "//patent"),
  function(patent){
    data.frame(
      patent = xmlAttrs( patent )[["patno"]],
      xmlToDataFrame(
        nodes = getNodeSet(patent,".//*[contains(local-name(), 'author')]")
      ),
      stringsAsFactors = FALSE
    )
  }
)

表 2

lapply(
  getNodeSet(patents, "//patent"),
  function(patent){
    data.frame(
      patent = xmlAttrs( patent )[["patno"]],
      tag = xpathSApply(
        patent,
        ".//tagcode",
        xmlValue
      ),
      stringsAsFactors = FALSE
    )
  }
)

表 3

我会把加入工作留给你。

lapply(
  getNodeSet(patents, "//authgrp"),
  function(autg){
    aff_df <- do.call(
      rbind,
      c(
        xpathApply(
          autg,
          ".//aff[@affid]",  # get only those with affid attr
          function(aff){
            data.frame(
              aff_id = xmlAttrs(aff)[["affid"]],
              institution = xmlValue(aff)
            )
          }
        ),
        xpathApply(
          autg,
          ".//aff[not(@affid)]",  # get only those without affid
          function(aff){
            data.frame(
              aff_id = NA,
              institution = xmlValue(aff)
            )
          }
        )
      )
    )

    authors <- getNodeSet( autg, "./author")
    aut_df <- xmlToDataFrame( nodes = authors )
    aut_df$aff_id <- lapply(
      1:length(authors)
      ,function(i){
        if(!is.null(xmlAttrs(authors[[i]])[["affref"]])){
          xmlAttrs(authors[[i]])[["affref"]]
        } else {
          NA
        }
      }
    )

    list(aff_df,aut_df)
  }
)

【讨论】：