【发布时间】:2021-08-17 05:51:14
【问题描述】:
我在将 XML 文件加载到 R 数据框时遇到问题。
这是我的XML结构[数据组成]:
<?xml version="1.0" encoding="UTF-8"?>
-<CancerExtract>
-<CancerRegRec>
-<Demographic>
-<PatientName>
<PatSurname>Jones</PatSurname>
<PatFirstName>John</PatFirstName>
<PatSecondName>Peter</PatSecondName>
</PatientName>
-<PatientDetail Sex="1" IndigStatus="12">
<DOB>01012000</DOB>
<MedicareNo>xxxx776xxx66xx</MedicareNo>
<COB>1101</COB>
<Language>1201</Language>
</PatientDetail>
-<PatientAddress>
<StreetAddr>1 Address Rd</StreetAddr>
<Suburb>AwesomeCity</Suburb>
<Postcode>ZZ304</Postcode>
</PatientAddress>
</Demographic>
-<Tumour>
-<TreatingDoctor>
<TDSurname>Doctor</TDSurname>
<TDFirstName>The Good</TDFirstName>
<TDAddress>FixemUp ct</TDAddress>
<TDMediProvidNo>DR0001</TDMediProvidNo>
</TreatingDoctor>
-<HospitalEpisode>
<HospitalName>FixMeUp</HospitalName>
<CampusCode>0000</CampusCode>
<URN>123456</URN>
<AdmissionDate>01012020</AdmissionDate>
<DischargeDate>03012020</DischargeDate>
</HospitalEpisode>
-<TumourDetail Grade="1" ECOG="9">
<DiagnosisDate>01012050</DiagnosisDate>
<PrimarySite>C61</PrimarySite>
<Morph>81403</Morph>
<Investigations>8 8 7 10 3</Investigations>
<AdditInfo>Some free text can be available here</AdditInfo>
</TumourDetail>
<CStage Stage="9" StagingSystem="99"/>
-<GP>
<GPSurname>MyGP</GPSurname>
<GPFirstName>Peter</GPFirstName>
<GPAddress>100 GP street</GPAddress>
</GP>
-<RegDetail>
<RegName>Some name</RegName>
<RegDate>05122021</RegDate>
</RegDetail>
</Tumour>
</CancerRegRec>
-<CancerRegRec>
-<Demographic>
-<PatientName>
<PatSurname>Pt2</PatSurname>
<PatFirstName>Frits</PatFirstName>
<PatSecondName/>
</PatientName>
-<PatientDetail Sex="4" IndigStatus="22" SomeOtherVariable="random value">
<DOB>12121834</DOB>
<MedicareNo>xxxxxxxx00001</MedicareNo>
<COB>1201</COB>
<Language>1201</Language>
</PatientDetail>
-<PatientAddress>
<StreetAddr>1 church street</StreetAddr>
<Suburb>Cityname Here</Suburb>
<Postcode>7777YY</Postcode>
</PatientAddress>
</Demographic>
-<Tumour>
+<TreatingDoctor>
-<HospitalEpisode>
<HospitalName>HospitalName two </HospitalName>
<CampusCode>2166192</CampusCode>
<URN>10REWR8XX640</URN>
<AdmissionDate>23122025</AdmissionDate>
<DischargeDate>23122027</DischargeDate>
</HospitalEpisode>
-<TumourDetail EstDateFlag="1" PriorDiagFlag="Y" Laterality="8">
<DiagnosisDate>01121812</DiagnosisDate>
<WhereDiagnosed>At home</WhereDiagnosed>
<PrimarySite>C9000</PrimarySite>
<Morph>81403</Morph>
<Investigations>7 3 1</Investigations>
<MetSite>C792 C788</MetSite>
<AdditInfo>This is a second record. </AdditInfo>
</TumourDetail>
<CStage Stage="9" StagingSystem="99"/>
-<GP>
<GPSurname>Jones</GPSurname>
<GPFirstName>John</GPFirstName>
<GPAddress>Test street 12 Unit 1</GPAddress>
</GP>
-<RegDetail>
<RegName>Me Myself and I</RegName>
<RegDate>01011801</RegDate>
</RegDetail>
</Tumour>
</CancerRegRec>
</CancerExtract>
我创建了这个 R 函数来加载文件并提取所有数据:
load_XML_File <- function(file){
load <- tryCatch(expr = { xml2::read_xml(file) },
warning = function(warning_condition) {
message(paste("\n\n\nWarning loading file: ", file))
message("\nHere's the original warning message:\n")
message(warning_condition)
return(NA)
},
error = function(error_condition) {
message(paste("\n\n\nError loading file: ", file))
message("\nHere's the original error message:\n")
message(error_condition)
return(NA)
},
finally = {
message(paste0("\nLoaded file ", file))
}
)
PerPt <- xml2::xml_find_all(load, ".//CancerRegRec")
tmp <- xml2::as_list(PerPt)
if(length(tmp) == 0){out <- NA}
if(length(tmp) >= 1){
for(i in 1:length(tmp)){
tt <- data.frame(t(data.frame(unlist(tmp[i]))))
rownames(tt) <- NULL
if(i == 1){out <- tt}
if(i > 1){out <- plyr::rbind.fill(out, tt)}
}
}
return(out)
}
这很好用,对我的目的来说足够快,但是 NOT 提取属性。 我将如何更改我的功能以便也包含属性?
> load_XML_File(file)
Loaded file H:/TMP/testFile.xml
Demographic.PatientName.PatSurname Demographic.PatientName.PatFirstName Demographic.PatientName.PatSecondName Demographic.PatientDetail.DOB
1 Jones John Peter 01012000
2 Pt2 Frits <NA> 12121834
Demographic.PatientDetail.MedicareNo Demographic.PatientDetail.COB Demographic.PatientDetail.Language Demographic.PatientAddress.StreetAddr
1 xxxx776xxx66xx 1101 1201 1 Address Rd
2 xxxxxxxx00001 1201 1201 1 church street
Demographic.PatientAddress.Suburb Demographic.PatientAddress.Postcode Tumour.TreatingDoctor.TDSurname Tumour.TreatingDoctor.TDFirstName
1 AwesomeCity ZZ304 Doctor The Good
2 Cityname Here 7777YY Jansen Jan
Tumour.TreatingDoctor.TDAddress Tumour.TreatingDoctor.TDMediProvidNo Tumour.HospitalEpisode.HospitalName Tumour.HospitalEpisode.CampusCode
1 FixemUp ct DR0001 FixMeUp 0000
2 Jansen rd DVR0001 HospitalName two 2166192
Tumour.HospitalEpisode.URN Tumour.HospitalEpisode.AdmissionDate Tumour.HospitalEpisode.DischargeDate Tumour.TumourDetail.DiagnosisDate
1 123456 01012020 03012020 01012050
2 10REWR8XX640 23122025 23122027 01121812
Tumour.TumourDetail.PrimarySite Tumour.TumourDetail.Morph Tumour.TumourDetail.Investigations Tumour.TumourDetail.AdditInfo Tumour.GP.GPSurname
1 C61 81403 8 8 7 10 3 Some free text can be available here MyGP
2 C9000 81403 7 3 1 This is a second record. Jones
Tumour.GP.GPFirstName Tumour.GP.GPAddress Tumour.RegDetail.RegName Tumour.RegDetail.RegDate Tumour.TumourDetail.WhereDiagnosed Tumour.TumourDetail.MetSite
1 Peter 100 GP street Some name 05122021 <NA> <NA>
2 John Test street 12 Unit 1 Me Myself and I 01011801 At home C792 C788
【问题讨论】: