【问题标题】:R Load XML to dataframe, and include attributesR将XML加载到数据框,并包含属性
【发布时间】:2021-08-17 05:51:14
【问题描述】:

我在将 XML 文件加载到 R 数据框时遇到问题。

这是我的XML结构[数据组成]:

<?xml version="1.0" encoding="UTF-8"?>

-<CancerExtract>

-<CancerRegRec>


-<Demographic>


-<PatientName>

<PatSurname>Jones</PatSurname>

<PatFirstName>John</PatFirstName>

<PatSecondName>Peter</PatSecondName>

</PatientName>


-<PatientDetail Sex="1" IndigStatus="12">

<DOB>01012000</DOB>

<MedicareNo>xxxx776xxx66xx</MedicareNo>

<COB>1101</COB>

<Language>1201</Language>

</PatientDetail>


-<PatientAddress>

<StreetAddr>1 Address Rd</StreetAddr>

<Suburb>AwesomeCity</Suburb>

<Postcode>ZZ304</Postcode>

</PatientAddress>

</Demographic>


-<Tumour>


-<TreatingDoctor>

<TDSurname>Doctor</TDSurname>

<TDFirstName>The Good</TDFirstName>

<TDAddress>FixemUp ct</TDAddress>

<TDMediProvidNo>DR0001</TDMediProvidNo>

</TreatingDoctor>


-<HospitalEpisode>

<HospitalName>FixMeUp</HospitalName>

<CampusCode>0000</CampusCode>

<URN>123456</URN>

<AdmissionDate>01012020</AdmissionDate>

<DischargeDate>03012020</DischargeDate>

</HospitalEpisode>


-<TumourDetail Grade="1" ECOG="9">

<DiagnosisDate>01012050</DiagnosisDate>

<PrimarySite>C61</PrimarySite>

<Morph>81403</Morph>

<Investigations>8 8 7 10 3</Investigations>

<AdditInfo>Some free text can be available here</AdditInfo>

</TumourDetail>

<CStage Stage="9" StagingSystem="99"/>


-<GP>

<GPSurname>MyGP</GPSurname>

<GPFirstName>Peter</GPFirstName>

<GPAddress>100 GP street</GPAddress>

</GP>


-<RegDetail>

<RegName>Some name</RegName>

<RegDate>05122021</RegDate>

</RegDetail>

</Tumour>

</CancerRegRec>


-<CancerRegRec>


-<Demographic>


-<PatientName>

<PatSurname>Pt2</PatSurname>

<PatFirstName>Frits</PatFirstName>

<PatSecondName/>

</PatientName>


-<PatientDetail Sex="4" IndigStatus="22" SomeOtherVariable="random value">

<DOB>12121834</DOB>

<MedicareNo>xxxxxxxx00001</MedicareNo>

<COB>1201</COB>

<Language>1201</Language>

</PatientDetail>


-<PatientAddress>

<StreetAddr>1 church street</StreetAddr>

<Suburb>Cityname Here</Suburb>

<Postcode>7777YY</Postcode>

</PatientAddress>

</Demographic>


-<Tumour>


+<TreatingDoctor>


-<HospitalEpisode>

<HospitalName>HospitalName two </HospitalName>

<CampusCode>2166192</CampusCode>

<URN>10REWR8XX640</URN>

<AdmissionDate>23122025</AdmissionDate>

<DischargeDate>23122027</DischargeDate>

</HospitalEpisode>


-<TumourDetail EstDateFlag="1" PriorDiagFlag="Y" Laterality="8">

<DiagnosisDate>01121812</DiagnosisDate>

<WhereDiagnosed>At home</WhereDiagnosed>

<PrimarySite>C9000</PrimarySite>

<Morph>81403</Morph>

<Investigations>7 3 1</Investigations>

<MetSite>C792 C788</MetSite>

<AdditInfo>This is a second record. </AdditInfo>

</TumourDetail>

<CStage Stage="9" StagingSystem="99"/>


-<GP>

<GPSurname>Jones</GPSurname>

<GPFirstName>John</GPFirstName>

<GPAddress>Test street 12 Unit 1</GPAddress>

</GP>


-<RegDetail>

<RegName>Me Myself and I</RegName>

<RegDate>01011801</RegDate>

</RegDetail>

</Tumour>

</CancerRegRec>

</CancerExtract>

我创建了这个 R 函数来加载文件并提取所有数据:

load_XML_File <- function(file){
  
  load <-   tryCatch(expr    = { xml2::read_xml(file) }, 
  warning = function(warning_condition) {
    message(paste("\n\n\nWarning loading file: ", file))
    message("\nHere's the original warning message:\n")
    message(warning_condition)
    return(NA)
  }, 
  error   = function(error_condition) {
    message(paste("\n\n\nError loading file: ", file))
    message("\nHere's the original error message:\n")
    message(error_condition)
    return(NA)
  }, 
  finally = {
    message(paste0("\nLoaded file ", file))
    }
  )
  
  
  PerPt    <- xml2::xml_find_all(load, ".//CancerRegRec")
  tmp      <- xml2::as_list(PerPt)

  if(length(tmp) == 0){out <- NA}
  if(length(tmp) >= 1){
    
    for(i in 1:length(tmp)){
      
      tt <- data.frame(t(data.frame(unlist(tmp[i]))))
      rownames(tt) <- NULL
      if(i == 1){out <- tt}
      if(i >  1){out <- plyr::rbind.fill(out,  tt)}
    }
    
   
  }
  
  return(out)
}

这很好用,对我的目的来说足够快,但是 NOT 提取属性。 我将如何更改我的功能以便也包含属性?

> load_XML_File(file)

Loaded file H:/TMP/testFile.xml
  Demographic.PatientName.PatSurname Demographic.PatientName.PatFirstName Demographic.PatientName.PatSecondName Demographic.PatientDetail.DOB
1                              Jones                                 John                                 Peter                      01012000
2                                Pt2                                Frits                                  <NA>                      12121834
  Demographic.PatientDetail.MedicareNo Demographic.PatientDetail.COB Demographic.PatientDetail.Language Demographic.PatientAddress.StreetAddr
1                       xxxx776xxx66xx                          1101                               1201                          1 Address Rd
2                        xxxxxxxx00001                          1201                               1201                       1 church street
  Demographic.PatientAddress.Suburb Demographic.PatientAddress.Postcode Tumour.TreatingDoctor.TDSurname Tumour.TreatingDoctor.TDFirstName
1                       AwesomeCity                               ZZ304                          Doctor                          The Good
2                     Cityname Here                              7777YY                          Jansen                               Jan
  Tumour.TreatingDoctor.TDAddress Tumour.TreatingDoctor.TDMediProvidNo Tumour.HospitalEpisode.HospitalName Tumour.HospitalEpisode.CampusCode
1                      FixemUp ct                               DR0001                             FixMeUp                              0000
2                       Jansen rd                              DVR0001                   HospitalName two                            2166192
  Tumour.HospitalEpisode.URN Tumour.HospitalEpisode.AdmissionDate Tumour.HospitalEpisode.DischargeDate Tumour.TumourDetail.DiagnosisDate
1                     123456                             01012020                             03012020                          01012050
2               10REWR8XX640                             23122025                             23122027                          01121812
  Tumour.TumourDetail.PrimarySite Tumour.TumourDetail.Morph Tumour.TumourDetail.Investigations        Tumour.TumourDetail.AdditInfo Tumour.GP.GPSurname
1                             C61                     81403                         8 8 7 10 3 Some free text can be available here                MyGP
2                           C9000                     81403                              7 3 1            This is a second record.                Jones
  Tumour.GP.GPFirstName   Tumour.GP.GPAddress Tumour.RegDetail.RegName Tumour.RegDetail.RegDate Tumour.TumourDetail.WhereDiagnosed Tumour.TumourDetail.MetSite
1                 Peter         100 GP street                Some name                 05122021                               <NA>                        <NA>
2                  John Test street 12 Unit 1          Me Myself and I                 01011801                            At home                   C792 C788

【问题讨论】:

    标签: r xml dataframe


    【解决方案1】:

    属性似乎存在于tmp

      PerPt    <- xml2::xml_find_all(load, ".//CancerRegRec")
      tmp      <- xml2::as_list(PerPt)
    

    此函数递归地访问列表的每个元素。它使属性成为元素的成员。

    move_attr_to_member <- function(x) {
        ## capture names, and attributes but not names
        names <- names(x)
        attributes <- attributes(unname(x))
    
        ## recursive application
        if (is.list(x))
            x <- lapply(x, fun)
    
        ## return x (with attributes but not names removed) and attributes
        attributes(x) <- NULL
        names(x) <- names
        c(x, attributes)
    }
    

    这可以像这样使用

    list_with_attrs_as_members <- move_attr_to_member(tmp)
    

    一个小标题很容易创建

    dplyr::bind_rows(lapply(list_with_attrs_as_members, unlist))
    

    我会仔细检查move_attr_to_member() 的输出,以确保它做对了!

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2012-09-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-11-07
      • 1970-01-01
      • 2015-10-07
      相关资源
      最近更新 更多