【问题标题】:XML file processing in RR中的XML文件处理
【发布时间】:2021-03-16 17:33:09
【问题描述】:

我是一个相对有经验的 R 用户,但我从未使用过 xml 文件。我可以使用 xml2::read_xml 加载一个 xml 文件,但我不能使用有效的列表对象。该文件很大,无法在此处粘贴,但我可以在此处粘贴文件示例和链接。这是一个分层报告,分组变量是“分区”,字段是“字段”,其中的“字段”是感兴趣的值。我的目标是根据这些数据(或 tibble,长格式)制作一个数据框。

分区节点的值属性应该是行名。是的,这些是嵌套的,这意味着有更多的列具有名称、子名称、子子名称等,这些都很重要。 (最前面或最后三列应该是 Universe 节点的名称,startdat 和 enddate 属性。)字段节点的类型属性应该是我的表的列名,单元格的值是字段的数量。

最后,这里是整个文件的链接(7-zipped xml):https://drive.google.com/file/d/1LlnX6mePzd73rk9kitEBmi_yEjKQRs2i/view?usp=sharing

还有一张xlxs输出格式截图:

示例:

<?xml version="1.0" encoding="UTF-8" ?>
<results xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    reportname="IQ performance" startdate="11/30/2020" enddate="2/26/2021">
    <universe id="U-0001" dateid="D-0001" name="U-0001"
        startdate="11/30/2020" enddate="12/31/2020">
        <partition id="P-0001" type="6924700104024916117" value="All"
            bucketnum="-1">
            <field id="A-0008" type="Excess Return (MSR)">0.07676352900773953</field>
            <field id="A-0013" type="Total Return Lcl Ccy">0.04057259532874724</field>
            <field id="A-0006" type="Coupon Return">0.14286550062017067</field>
            <field id="A-0012" type="Total Return">-0.9958805536684379</field>
            <field id="A-0010" type="Prepay Penalty Return">0</field>
            <field id="A-0009" type="Paydown Return">0</field>
            <field id="A-0007" type="Currency Return">-1.0364531489970519</field>
            <field id="A-0011" type="Price Return">-0.10229316638965535</field>
            <field id="A-0015" type="S/A Yield">0.004483473613543104</field>
            <field id="A-0001" type="Number of Instruments">6453</field>
            <field id="A-0005" type="MktVal HLDS">19482017305.036118</field>
            <field id="A-0002" type="Index Rating BOM">AA2/AA3</field>
            <field id="A-0004" type="Market Value Previous Month">19427649084.733593</field>
            <field id="A-0014" type="OAD">3.7185477934428124</field>
            <field id="A-0016" type="OAS (To Worst)">0.20330610590838008</field>
            <field id="A-0003" type="Market Value BOM">19856386358.19814</field>
        </partition>
        <partition id="P-0001" type="6924700104024916117" value="Others"
            bucketnum="01_14">
            <partition id="P-0001" type="6924700104024916117" value="EUR"
                bucketnum="02_02">
                <partition id="P-0001" type="6924700104024916117"
                    value="Credit" bucketnum="11_3">
                    <partition id="P-0001" type="6924700104024916117"
                        value="Short" bucketnum="21_1">
                        <field id="A-0008" type="Excess Return (MSR)">0.07183258338108178</field>
                        <field id="A-0013" type="Total Return Lcl Ccy">0.01214342353634823</field>
                        <field id="A-0006" type="Coupon Return">0.31482269238130023</field>
                        <field id="A-0012" type="Total Return">0.01214342353634823</field>
                        <field id="A-0010" type="Prepay Penalty Return"></field>
                        <field id="A-0009" type="Paydown Return">0</field>
                        <field id="A-0007" type="Currency Return">0</field>
                        <field id="A-0011" type="Price Return">-0.3026896289808234</field>
                        <field id="A-0015" type="S/A Yield">-0.4928755405648736</field>
                        <field id="A-0001" type="Number of Instruments">2</field>
                        <field id="A-0005" type="MktVal HLDS">2548003.4519999996</field>
                        <field id="A-0002" type="Index Rating BOM">AA2/AA3</field>
                        <field id="A-0004" type="Market Value Previous Month">2544467.735</field>
                        <field id="A-0014" type="OAD">0.32466544675466164</field>
                        <field id="A-0016" type="OAS (To Worst)">0.22419945895458546</field>
                        <field id="A-0003" type="Market Value BOM">2547694.23</field>
                    </partition>
                    <partition id="P-0001" type="6924700104024916117"
                        value="Intermediate" bucketnum="22_2">
                        <field id="A-0008" type="Excess Return (MSR)">0.8805662483038379</field>
                        <field id="A-0013" type="Total Return Lcl Ccy">0.7722013521882509</field>
                        <field id="A-0006" type="Coupon Return">0.17896670856365482</field>
                        <field id="A-0012" type="Total Return">0.7722013521882509</field>
                        <field id="A-0010" type="Prepay Penalty Return"></field>
                        <field id="A-0009" type="Paydown Return">0</field>
                        <field id="A-0007" type="Currency Return">0</field>
                        <field id="A-0011" type="Price Return">0.5932365519324856</field>
                        <field id="A-0015" type="S/A Yield">0.345214292489727</field>
                        <field id="A-0001" type="Number of Instruments">5</field>
                        <field id="A-0005" type="MktVal HLDS">4722524.7425</field>
                        <field id="A-0002" type="Index Rating BOM">A1/A2</field>
                        <field id="A-0004" type="Market Value Previous Month">4652650.297499999</field>
                        <field id="A-0014" type="OAD">4.400948577468358</field>
                        <field id="A-0016" type="OAS (To Worst)">1.0808156019106423</field>
                        <field id="A-0003" type="Market Value BOM">4686336.749</field>
                    </partition>
                    <partition id="P-0001" type="6924700104024916117"
                        value="All" bucketnum="-1">
                        <field id="A-0008" type="Excess Return (MSR)">0.5957449505084811</field>
                        <field id="A-0013" type="Total Return Lcl Ccy">0.5045227640105621</field>
                        <field id="A-0006" type="Coupon Return">0.22681271683868687</field>
                        <field id="A-0012" type="Total Return">0.5045227640105621</field>
                        <field id="A-0010" type="Prepay Penalty Return"></field>
                        <field id="A-0009" type="Paydown Return">0</field>
                        <field id="A-0007" type="Currency Return">0</field>
                        <field id="A-0011" type="Price Return">0.2777076347568519</field>
                        <field id="A-0015" type="S/A Yield">0.051500310426545876</field>
                        <field id="A-0001" type="Number of Instruments">7</field>
                        <field id="A-0005" type="MktVal HLDS">7270528.194499999</field>
                        <field id="A-0002" type="Index Rating BOM">AA3/A1</field>
                        <field id="A-0004" type="Market Value Previous Month">7197118.032499999</field>
                        <field id="A-0014" type="OAD">2.9723888895704493</field>
                        <field id="A-0016" type="OAS (To Worst)">0.7806089551718131</field>
                        <field id="A-0003" type="Market Value BOM">7234030.979</field>
                    </partition>
                </partition>
                <partition id="P-0001" type="6924700104024916117" value="All"
                    bucketnum="-1">
                    <field id="A-0008" type="Excess Return (MSR)">0.5957449505084811</field>
                    <field id="A-0013" type="Total Return Lcl Ccy">0.5045227640105621</field>
                    <field id="A-0006" type="Coupon Return">0.22681271683868687</field>
                    <field id="A-0012" type="Total Return">0.5045227640105621</field>
                    <field id="A-0010" type="Prepay Penalty Return"></field>
                    <field id="A-0009" type="Paydown Return">0</field>
                    <field id="A-0007" type="Currency Return">0</field>
                    <field id="A-0011" type="Price Return">0.2777076347568519</field>
                    <field id="A-0015" type="S/A Yield">0.051500310426545876</field>
                    <field id="A-0001" type="Number of Instruments">7</field>
                    <field id="A-0005" type="MktVal HLDS">7270528.194499999</field>
                    <field id="A-0002" type="Index Rating BOM">AA3/A1</field>
                    <field id="A-0004" type="Market Value Previous Month">7197118.032499999</field>
                    <field id="A-0014" type="OAD">2.9723888895704493</field>
                    <field id="A-0016" type="OAS (To Worst)">0.7806089551718131</field>
                    <field id="A-0003" type="Market Value BOM">7234030.979</field>
                </partition>
            </partition>
            <partition id="P-0001" type="6924700104024916117" value="All"
                bucketnum="-1">
                <field id="A-0008" type="Excess Return (MSR)">0.5957449505084811</field>
                <field id="A-0013" type="Total Return Lcl Ccy">0.5045227640105621</field>
                <field id="A-0006" type="Coupon Return">0.22681271683868687</field>
                <field id="A-0012" type="Total Return">0.5045227640105621</field>
                <field id="A-0010" type="Prepay Penalty Return"></field>
                <field id="A-0009" type="Paydown Return">0</field>
                <field id="A-0007" type="Currency Return">0</field>
                <field id="A-0011" type="Price Return">0.2777076347568519</field>
                <field id="A-0015" type="S/A Yield">0.051500310426545876</field>
                <field id="A-0001" type="Number of Instruments">7</field>
                <field id="A-0005" type="MktVal HLDS">7270528.194499999</field>
                <field id="A-0002" type="Index Rating BOM">AA3/A1</field>
                <field id="A-0004" type="Market Value Previous Month">7197118.032499999</field>
                <field id="A-0014" type="OAD">2.9723888895704493</field>
                <field id="A-0016" type="OAS (To Worst)">0.7806089551718131</field>
                <field id="A-0003" type="Market Value BOM">7234030.979</field>
            </partition>
        </partition>

【问题讨论】:

  • “我不能使用有效的列表对象”到底是什么意思。你需要做什么?您发布的数据看起来不像数据框,因此不清楚该数据的 data.frame 可能是什么样的。 XML 可以是一种高度嵌套的格式,而 data.frames 主要用于保存具有相同行数和列数的矩形数据。
  • 列表对象不包含值,但分区和字段在那里。数据不是简单的矩形数据——也许在这种情况下我可以解决我的任务。这就像 Excel 中的数据透视表,带有摘要行,所以我认为它应该适合长数据框。
  • 分区节点内有分区节点。您的问题描述不是很清楚您要提取哪些数据。难道只是没有分区信息的字段节点?
  • 谢谢@Dave2e。我编辑了原始帖子:做了一些澄清并附上了截图。

标签: r xml parsing


【解决方案1】:

我想我正在寻找解决方案。它未经测试,它可能更灵活,但我现在分享它:

library(XML)
library(dplyr)

mydoc <- xmlParse("file:///path/to/my/weird.xml")

newcols <- c("report", "rep_start_dt", "rep_end_dt", 
             "universe id", "dateid", NA,
             "startdate", "enddate",
             "partition id", NA, "region", "bucketnum1",
             NA, NA, "currency", "bucketnum2",
             NA, NA, "sector", "bucketnum3",
             NA, NA, "maturity", "bucketnum4",
             "field id", "field")

rawdata <- bind_cols(
    path = xpathSApply(mydoc, path="/results/universe/partition/partition/partition/partition/field", function(y)paste(unlist( xmlAncestors(y, fun=xmlAttrs)), collapse="|")),
    value = xpathSApply(mydoc, path="/results/universe/partition/partition/partition/partition/field", xmlValue)
) %>% 
    tidyr::separate(path, into = newcols, sep = "\\|", convert = FALSE) %>%
    mutate_at(vars(rep_start_dt, rep_end_dt, startdate, enddate), lubridate::mdy)

DBI::sqlAppendTable(DBI::ANSI(), "iq_report", rawdata, row.names = FALSE) %>% write(file = bzfile("iq_report.sql.bz2"))
closeAllConnections()

如果您有意见,请分享。 彼得

【讨论】:

    猜你喜欢
    • 2012-06-02
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多