【问题标题】:Parse xml to dataframe including children and attributes in R将 xml 解析为数据框,包括 R 中的子项和属性
【发布时间】:2019-10-14 12:04:07
【问题描述】:

我正在尝试从附加的 xml 创建一个数据框 https://1drv.ms/u/s!Am7buNMZi-gwgeBmbk6A-NRIRarjYw?e=Pcgm7c

我需要为所有玩家获取他们的栏目信息和关​​于团队(父母)的信息

XML 示例

<SoccerFeed timestamp="20190519T183022+0000">
  <SoccerDocument Type="SQUADS Latest" competition_code="ES_PL" competition_id="23" competition_name="Spanish La Liga" season_id="2018" season_name="Season 2018/2019">
    <Team country="Spain" country_id="4" country_iso="ES" official_club_name="Deportivo Alavés S.A.D." region_id="17" region_name="Europe" short_club_name="Alavés" uID="t173">
      <Founded>1921</Founded>
      <Name>Alavés</Name>
      <Player uID="p91406">
        <Name>Fernando Pacheco</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Fernando</Stat>
        <Stat Type="last_name">Pacheco</Stat>
        <Stat Type="birth_date">1992-05-18</Stat>
        <Stat Type="birth_place">Badajoz</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Left</Stat>
        <Stat Type="weight">81</Stat>
        <Stat Type="height">186</Stat>
        <Stat Type="jersey_num">1</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2015-08-07</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
      <Player uID="p176245">
        <Name>Antonio Sivera</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Antonio</Stat>
        <Stat Type="last_name">Sivera</Stat>
        <Stat Type="birth_date">1996-08-11</Stat>
        <Stat Type="birth_place">Jávea</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Right</Stat>
        <Stat Type="weight">75</Stat>
        <Stat Type="height">184</Stat>
        <Stat Type="jersey_num">13</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2017-07-19</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
     </Team>
     <Team city="Madrid" country="Spain" country_id="4" country_iso="ES" official_club_name="Club Atlético de Madrid S.A.D" postal_code="28005" region_id="17" region_name="Europe" short_club_name="Atlético" street="Paseo Virgen del Puerto, 67" uID="t175" web_address="www.clubatleticodemadrid.com/">
      <Founded>1903</Founded>
      <Name>Atlético de Madrid</Name>
      <Player uID="p59981">
        <Name>Antonio Adán</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Antonio</Stat>
        <Stat Type="last_name">Adán</Stat>
        <Stat Type="birth_date">1987-05-13</Stat>
        <Stat Type="birth_place">Madrid</Stat>
        <Stat Type="first_nationality">Spain</Stat>
        <Stat Type="preferred_foot">Left</Stat>
        <Stat Type="weight">92</Stat>
        <Stat Type="height">190</Stat>
        <Stat Type="jersey_num">1</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2018-07-10</Stat>
        <Stat Type="country">Spain</Stat>
      </Player>
      <Player uID="p81352">
        <Name>Jan Oblak</Name>
        <Position>Goalkeeper</Position>
        <Stat Type="first_name">Jan</Stat>
        <Stat Type="last_name">Oblak</Stat>
        <Stat Type="birth_date">1993-01-07</Stat>
        <Stat Type="birth_place">Skojfa Loka</Stat>
        <Stat Type="first_nationality">Slovenia</Stat>
        <Stat Type="preferred_foot">Right</Stat>
        <Stat Type="weight">87</Stat>
        <Stat Type="height">188</Stat>
        <Stat Type="jersey_num">13</Stat>
        <Stat Type="real_position">Goalkeeper</Stat>
        <Stat Type="real_position_side">Unknown</Stat>
        <Stat Type="join_date">2014-07-16</Stat>
        <Stat Type="country">Slovenia</Stat>
      </Player>
     </Team>
   </SoccerDocument>
</SoccerFeed>



我想要的栏目

团队栏

  • 国家(SoccerFeed/SoccerDocument/Team 属性)
  • country_id(SoccerFeed/SoccerDocument/Team 属性)
  • country_iso(SoccerFeed/SoccerDocument/Team 属性)
  • official_club_name(SoccerFeed/SoccerDocument/Team 属性)
  • region_id(SoccerFeed/SoccerDocument/Team 属性)
  • region_name(SoccerFeed/SoccerDocument/Team 属性)
  • short_club_name(SoccerFeed/SoccerDocument/Team 属性)
  • team_uID(SoccerFeed/SoccerDocument/团队属性 uID)
  • team_name(SoccerFeed/SoccerDocument/团队/名称)
  • team_founded(SoccerFeed/SoccerDocument/Team/Founded)

球员栏

  • player_uID (/SoccerFeed/SoccerDocument/Team/Player)

  • player_name (/SoccerFeed/SoccerDocument/Team/Player/Name)

  • player_position (/SoccerFeed/SoccerDocument/Team/Player/Position)

  • player_first_name(/SoccerFeed/SoccerDocument/Team/Player/Stat type = 名字)
  • player_last_name(/SoccerFeed/SoccerDocument/Team/Player/Stat type = 姓氏)
  • player_first_name(/SoccerFeed/SoccerDocument/Team/Player/Stat type = 名字)
  • player_birth_place(/SoccerFeed/SoccerDocument/Team/Player/Stat type = 出生地)
  • player_preferred_foot (/SoccerFeed/SoccerDocument/Team/Player/Stat type = preferred_foot) ...其他球员统计数据(体重、身高、球衣号码、...国家/地区)

我对 /SoccerFeed/SoccerDocument/PlayerChanges 部分下方球员更改下的团队和球员节点不感兴趣

我开始使用 tidyverse 和 xml2 来结合 tidyverse 收集玩家信息,但我无法获取团队父级信息和玩家的不同统计数据


library(xml2)
library(tidyverse)
library(plyr)



x <- read_xml("squads.xml")

players <- x %>% 
  xml_find_all('/SoccerFeed/SoccerDocument/Team/Player') %>% 
  map_df(~flatten(c(xml_attrs(.x), 
                    map(xml_children(.x), 
                        ~set_names(as.list(xml_text(.x)), xml_name(.x)))))) %>%
  type_convert()

【问题讨论】:

    标签: r xml tidyverse xml2


    【解决方案1】:

    由于您使用xml2 并且需要在嵌套级别上不同的各种数据节点,请考虑XSLT,这是一种旨在转换 XML 文件的专用语言(如 SQL)。在 R 中,xslt 包,xml2 的姊妹模块,可以运行 XSLT 1.0 脚本。 XSLT 的递归、模板特性有助于避免复杂的嵌套循环或应用层的映射,这里是 R。此外,XSLT 是可移植的(类似于 SQL),并且可以在 R 之外运行。

    虽然这可能是一个全新的概念,需要学习曲线,但它可以清晰地将您的 XML 扁平化为数据集所需的二维结构。您还将 XML 处理 (XSLT) 与数据处理 (R) 分开。具体来说,只有 Player 级别被保留,相应的 Team 数据向下迁移(参见演示)。

    XSLT (另存为.xsl,一个特殊的.xml文件)

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
      <xsl:output indent="yes"/>
      <xsl:strip-space elements="*"/>
    
      <xsl:template match="/SoccerFeed|SoccerDocument">
          <xsl:apply-templates select="*"/>
      </xsl:template>
    
      <xsl:template match="Team">
          <xsl:apply-templates select="Player"/>
      </xsl:template>
    
      <xsl:template match="Team/@*">
        <xsl:element name="{concat('team_', name(.))}">
          <xsl:value-of select="."/>      
        </xsl:element>
      </xsl:template>
    
      <xsl:template match="Player">
        <xsl:copy>
          <xsl:apply-templates select="ancestor::Team/@*"/>
          <xsl:copy-of select="Name|Position"/>
          <xsl:apply-templates select="@*|Stat"/>
        </xsl:copy>
      </xsl:template>
    
      <xsl:template match="Player/@*">
        <xsl:element name="{name(.)}">
          <xsl:value-of select="."/>      
        </xsl:element>
      </xsl:template>
    
      <xsl:template match="Stat">
        <xsl:element name="{@Type}">
          <xsl:value-of select="text()"/>     
        </xsl:element>
      </xsl:template>
    </xsl:stylesheet>
    

    Online Demo

    R (产生所有字符类型的数据框)

    library(xml2)
    library(xslt)
    library(dplyr)
    
    # INPUT SOURCE
    doc <- read_xml("/path/to/Input.xml")
    style <- read_xml("/path/to/Style.xsl", package = "xslt")
    
    # TRANSFORM 
    new_xml <- xml_xslt(doc, style)
    
    # RETRIEVE Player NODES
    recs <- xml_find_all(new_xml, "//Player")
    
    # BIND EACH CHILD TEXT AND NAME TO Player DFs
    df_list <- lapply(recs, function(r) 
        data.frame(rbind(setNames(xml_text(xml_children(r)), 
                                  xml_name(xml_children(r)))),
                   stringsAsFactors = FALSE)
    )
    
    # BIND ALL DFs TO SINGLE MASTER DF
    final_df <- bind_rows(df_list)
    

    【讨论】:

    • 这对我的解决方案是正确的,xslt 方法非常优雅,谢谢!
    • 太棒了!感谢您使用 XSLT 的机会。很少有人像我一样喜欢它!
    猜你喜欢
    • 2015-10-07
    • 1970-01-01
    • 1970-01-01
    • 2013-06-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-10-15
    相关资源
    最近更新 更多