【问题标题】:How to get data out of nested xml structure?如何从嵌套的 xml 结构中获取数据?
【发布时间】:2020-10-22 06:43:58
【问题描述】:

我正在尝试使用以嵌套 XML 形式提供数据的 API,我想将其保存为数据框。我的问题是我不知道如何从这个嵌套的 XML 中获取值。这是一个例子:

# Sample data
library(xml2)
url <- "https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=xml"
download.file(url, destfile = "xml_data.xml")
fil <- "xml_data.xml"
dat <- xml2::read_xml(fil)

这给出了一个嵌套的 xml 文件,但我不明白如何使用这个结构。

<FullStudiesResponse>
  ....
  <FullStudyList>
    <FullStudy Rank="1">
      <Struct Name="Study">
        <Struct Name="ProtocolSection">
          <Struct Name="IdentificationModule">
            <Field Name="NCTId">NCT01843582</Field>

我可以使用以下命令进入 FullStudyList:

xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy")

但是例如,如果我想获取所有NCTIdRank 值,我该如何引用呢?到目前为止我已经尝试过

xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy/@NCTId")
xml_find_all(x = dat, xpath = "//FullStudyList/FullStudy//NCTId")

这显然行不通。或者有没有更好的方法来使用嵌套 xml 来获取数据框中的数据?

【问题讨论】:

  • 你想要什么样的输出,更准确地说?数据框的列是什么? NCTIdRank 被称为 attributes,请参阅 ?xml_attr 了解如何获取属性的值。

标签: r xml web-scraping nested


【解决方案1】:

简短的回答是:不要使用 XML。该网站的以下文档说您可以指定所需的fmt。它不必是 XML。 JSON 在 R 中更容易处理。

试试这个

library(httr)
library(jsonlite)
library(tibble)

res <- fromJSON(content(GET("https://clinicaltrials.gov/api/query/full_studies?expr=neuro&min_rnk=1&max_rnk=20&fmt=json")))

结果是一个嵌套列表,但我猜你对FullStudies中存储的数据感兴趣

df <- as_tibble(res$FullStudiesResponse$FullStudies)

这给了我们

# A tibble: 20 x 2
    Rank Study$ProtocolS~ $$$OrgStudyIdIn~ $$$$OrgStudyIdT~ $$$$OrgStudyIdL~ $$$Organization~ $$$$OrgClass $$$BriefTitle $$$OfficialTitle $$$Acronym $$StatusModule$~
   <int> <chr>            <chr>            <chr>            <chr>            <chr>            <chr>        <chr>         <chr>            <chr>      <chr>           
 1     1 NCT02642055      NEURO+001        NA               NA               Neuro+           INDUSTRY     Efficacy of ~ Efficacy of NEU~ NA         May 2016        
 2     2 NCT01801813      RC12_0416        NA               NA               Nantes Universi~ OTHER        Risk Factors~ Observational S~ Craniosco~ March 2016      
 3     3 NCT03813290      DSRB A/2018/006~ NA               NA               National Health~ OTHER_GOV    A Neuro-Tech~ A Neuro-Technol~ NA         February 2020   
 4     4 NCT03773926      2018-A00604-51   NA               NA               Zeta Technologi~ INDUSTRY     Neuro-feedba~ Neuro-feedback ~ TNTA       December 2018   
 5     5 NCT04189172      AAG-O-H-1630     NA               NA               Aesculap AG      INDUSTRY     MiDura-Study~ Multicenter, In~ MiDura     May 2020        
 6     6 NCT03756337      PIC-20           NA               NA               Oticon Medical   INDUSTRY     Neuro 1 vs. ~ Comparison of A~ NA         November 2018   
 7     7 NCT03484143      P17.03           NA               NA               Vielight Inc.    INDUSTRY     Neuro RX Gam~ Vielight Neuro ~ NA         June 2020       
 8     8 NCT02138110      InVivo-100-101   NA               NA               InVivo Therapeu~ INDUSTRY     The INSPIRE ~ The INSPIRE Stu~ NA         December 2019   
 9     9 NCT03935724      A2017SCI03       NA               NA               Neuroplast       INDUSTRY     Clinical Stu~ A Multi-center,~ SCI2       September 2020  
10    10 NCT03798002      RiphahI Maryam ~ NA               NA               Riphah Internat~ OTHER        Neuro-muscul~ Effects of Neur~ NA         August 2019     
11    11 NCT03655262      R61MH113772      U.S. NIH Grant/~ https://project~ University of C~ OTHER        Treating Pho~ Treating Phobia~ NA         April 2019      
12    12 NCT04418609      Neuro-COVID-19   NA               NA               University of Z~ OTHER        Neuro-COVID-~ Neuro-COVID-19:~ Neuro-COV~ June 2020       
13    13 NCT01174329      1234             NA               NA               Universidad Aut~ OTHER        Treatment of~ Difference in S~ SALELECTR~ July 2010       
14    14 NCT04205019      A2019SCI04       NA               NA               Neuroplast       INDUSTRY     Safety Stem ~ A 3 Months Open~ SSCiSCI    September 2020  
15    15 NCT02941627      PIC_07           NA               NA               Oticon Medical   INDUSTRY     The Neuro Zt~ The Neuro Zti C~ NA         February 2017   
16    16 NCT03328195      P17.02           NA               NA               Vielight Inc.    INDUSTRY     Vielight Neu~ A Pilot Study E~ NA         September 2020  
17    17 NCT02401841      Policlinico 12   NA               NA               Policlinico Hos~ OTHER        Resolution o~ Resolution of N~ NA         October 2015    
18    18 NCT03882567      03/2015          NA               NA               Universidad Rey~ OTHER        Effectivenes~ Effectiveness o~ SCENAR     October 2019    
19    19 NCT04583163      2019-0945        NA               NA               Hackensack Meri~ OTHER        Variability ~ Inter- and Intr~ NA         October 2020    
20    20 NCT01845155      CMTR-TC-02       NA               NA               German Center f~ OTHER        Neuro-Music-~ Neuro-Music-The~ NA         February 2014   
# ... with 103 more variables: $$$OverallStatus <chr>, $$$ExpandedAccessInfo$HasExpandedAccess <chr>, $$$StartDateStruct$StartDate <chr>, $$$$StartDateType <chr>,
#   $$$PrimaryCompletionDateStruct$PrimaryCompletionDate <chr>, $$$$PrimaryCompletionDateType <chr>, $$$CompletionDateStruct$CompletionDate <chr>,
#   $$$$CompletionDateType <chr>, $$$StudyFirstSubmitDate <chr>, $$$StudyFirstSubmitQCDate <chr>, $$$StudyFirstPostDateStruct$StudyFirstPostDate <chr>,
#   $$$$StudyFirstPostDateType <chr>, $$$LastUpdateSubmitDate <chr>, $$$LastUpdatePostDateStruct$LastUpdatePostDate <chr>, $$$$LastUpdatePostDateType <chr>,
#   $$$ResultsFirstSubmitDate <chr>, $$$ResultsFirstSubmitQCDate <chr>, $$$ResultsFirstPostDateStruct$ResultsFirstPostDate <chr>, $$$$ResultsFirstPostDateType <chr>,
#   $$$LastKnownStatus <chr>, $$SponsorCollaboratorsModule$ResponsibleParty$ResponsiblePartyType <chr>, $$$$ResponsiblePartyInvestigatorFullName <chr>,
#   $$$$ResponsiblePartyInvestigatorTitle <chr>, $$$$ResponsiblePartyInvestigatorAffiliation <chr>, $$$$ResponsiblePartyOldNameTitle <chr>,
#   $$$$ResponsiblePartyOldOrganization <chr>, $$$LeadSponsor$LeadSponsorName <chr>, $$$$LeadSponsorClass <chr>, $$$CollaboratorList$Collaborator <list>,
#   $$OversightModule$OversightHasDMC <chr>, $$$IsFDARegulatedDrug <chr>, $$$IsFDARegulatedDevice <chr>, $$$IsUnapprovedDevice <chr>, $$$IsUSExport <chr>,
#   $$DescriptionModule$BriefSummary <chr>, $$$DetailedDescription <chr>, $$ConditionsModule$ConditionList$Condition <list>, $$$KeywordList$Keyword <list>,
#   $$DesignModule$StudyType <chr>, $$$PhaseList$Phase <list>, $$$DesignInfo$DesignAllocation <chr>, $$$$DesignInterventionModel <chr>,
#   $$$$DesignPrimaryPurpose <chr>, $$$$DesignMaskingInfo$DesignMasking <chr>, $$$$$DesignWhoMaskedList$DesignWhoMasked <list>, $$$$$DesignMaskingDescription <chr>,
#   $$$$DesignObservationalModelList$DesignObservationalModel <list>, $$$$DesignTimePerspectiveList$DesignTimePerspective <list>,
#   $$$$DesignInterventionModelDescription <chr>, $$$EnrollmentInfo$EnrollmentCount <chr>, $$$$EnrollmentType <chr>, $$$PatientRegistry <chr>,
#   $$$TargetDuration <chr>, $$ArmsInterventionsModule$ArmGroupList$ArmGroup <list>, $$$InterventionList$Intervention <list>,
#   $$OutcomesModule$PrimaryOutcomeList$PrimaryOutcome <list>, $$$SecondaryOutcomeList$SecondaryOutcome <list>, $$$OtherOutcomeList$OtherOutcome <list>,
#   $$EligibilityModule$EligibilityCriteria <chr>, $$$HealthyVolunteers <chr>, $$$Gender <chr>, $$$MinimumAge <chr>, $$$MaximumAge <chr>, $$$StdAgeList$StdAge <list>,
#   $$$StudyPopulation <chr>, $$$SamplingMethod <chr>, $$ContactsLocationsModule$OverallOfficialList$OverallOfficial <list>, $$$LocationList$Location <list>,
#   $$$CentralContactList$CentralContact <list>, $$IPDSharingStatementModule$IPDSharing <chr>, $$ReferencesModule$ReferenceList$Reference <list>,
#   $$$SeeAlsoLinkList$SeeAlsoLink <list>, $DerivedSection$MiscInfoModule$VersionHolder <chr>, $$$RemovedCountryList$RemovedCountry <list>,
#   $$ConditionBrowseModule$ConditionMeshList$ConditionMesh <list>, $$$ConditionAncestorList$ConditionAncestor <list>,
#   $$$ConditionBrowseLeafList$ConditionBrowseLeaf <list>, $$$ConditionBrowseBranchList$ConditionBrowseBranch <list>,
#   $$InterventionBrowseModule$InterventionBrowseLeafList$InterventionBrowseLeaf <list>, $$$InterventionBrowseBranchList$InterventionBrowseBranch <list>,
#   $ResultsSection$ParticipantFlowModule$FlowGroupList$FlowGroup <list>, $$$FlowPeriodList$FlowPeriod <list>, $$$FlowPreAssignmentDetails <chr>,
#   $$$FlowRecruitmentDetails <chr>, $$BaselineCharacteristicsModule$BaselinePopulationDescription <chr>, $$$BaselineGroupList$BaselineGroup <list>,
#   $$$BaselineDenomList$BaselineDenom <list>, $$$BaselineMeasureList$BaselineMeasure <list>, $$OutcomeMeasuresModule$OutcomeMeasureList$OutcomeMeasure <list>,
#   $$AdverseEventsModule$EventsFrequencyThreshold <chr>, $$$EventsTimeFrame <chr>, $$$EventGroupList$EventGroup <list>, $$$SeriousEventList$SeriousEvent <list>,
#   $$$OtherEventList$OtherEvent <list>, $$MoreInfoModule$CertainAgreement$AgreementPISponsorEmployee <chr>, $$$$AgreementRestrictiveAgreement <chr>,
#   $$$PointOfContact$PointOfContactTitle <chr>, $$$$PointOfContactOrganization <chr>, $$$$PointOfContactEMail <chr>, $$$$PointOfContactPhone <chr>, ...

【讨论】:

  • 是的! json 格式和df &lt;- as_tibble(res$FullStudiesResponse$FullStudies) 完成工作。我也在努力获得 json 的形状,所以我尝试了 xml 格式的快捷方式。但这太棒了。谢谢@ekoam
  • 感谢您提供这个非常有用的解决方案。刚刚找到了一种方法来处理生成的嵌套列表from the clinicaltrial.gov JSON files directly(不是通过 API)。以防万一有人想直接使用它们。
猜你喜欢
  • 2011-06-26
  • 2021-10-27
  • 2016-06-13
  • 1970-01-01
  • 2016-02-07
  • 1970-01-01
  • 1970-01-01
  • 2017-02-06
  • 2022-11-17
相关资源
最近更新 更多