【问题标题】:Python / BeautifulSoup - Scraping XML data from Clinicaltrials.gov API - parse data within XML parent/child tagsPython / BeautifulSoup - 从 Clinicaltrials.gov API 抓取 XML 数据 - 在 XML 父/子标签中解析数据
【发布时间】:2021-12-28 20:14:52
【问题描述】:

我是使用 XML 和 BeautifulSoup 的新手,我正在尝试使用 Clinicaltrials.gov 的新 API 获取临床试验数据集,该 API 将试验列表转换为 XML 数据集。我尝试使用find_all(),就像我通常使用 HTML 一样,但我没有同样的运气。我尝试了其他一些方法,例如转换为字符串和拆分(非常混乱),但我不想让我的代码因尝试失败而变得混乱。

底线:我想提取所有的 NCTId(我知道我可以将整个内容转换为字符串并使用正则表达式,但我想学习如何正确解析 XML)和官方XML 文件中列出的每个临床试验的标题。任何帮助表示赞赏!

import requests
from bs4 import BeautifulSoup
from lxml import etree
import lxml.html

url = 'https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
m1_nctid = soup.find_all('Field Name="NCTId"') #This comes back with 0 results
m1_officialtitle = soup.find_all('Field Name="OfficialTitle"') #This comes back with 0 results

【问题讨论】:

    标签: python xml web-scraping beautifulsoup


    【解决方案1】:

    您可以搜索小写的field 标签,并将name 作为属性传递给attrs。这仅适用于BeautifulSoup,无需使用etree

    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://clinicaltrials.gov/api/query/full_studies?expr=diabetes+telehealth+peer+support&+AREA%5BStartDate%5D+EXPAND%5BTerm%5D+RANGE%5B01%2F01%2F2020%2C+09%2F01%2F2020%5D&min_rnk=1&max_rnk=10&fmt=xml"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "lxml")
    
    m1_nctid = soup.find_all("field", attrs={"name": "NCTId"})
    m1_officialtitle = soup.find_all("field", attrs={"name": "OfficialTitle"})
    

    【讨论】:

      【解决方案2】:

      您可以过滤如下属性:

      m1_nctid = soup.findAll("field", {"name" : "NCTId"})
      m1_officialtitle = soup.findAll("field", {"name" : "OfficialTitle"})
      

      然后迭代每个结果以获取文本,例如:

      official_titles = [result.text for result in m1_officialtitle]
      

      更多信息,您可以查看文档here

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-06-06
        • 2021-12-29
        • 2013-01-08
        • 2012-01-24
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多