【发布时间】:2018-02-04 02:02:58
【问题描述】:
我目前正在编写一个脚本来从ClinicalTrials.gov 中抓取数据。为此,我编写了以下脚本:
def clinicalTrialsGov (id):
url = "https://clinicaltrials.gov/ct2/show/" + id + "?displayxml=true"
data = BeautifulSoup(requests.get(url).text, "lxml")
studyType = data.study_type.text
if studyType == 'Interventional':
allocation = data.allocation.text
interventionModel = data.intervention_model.text
primaryPurpose = data.primary_purpose.text
masking = data.masking.text
enrollment = data.enrollment.text
officialTitle = data.official_title.text
condition = data.condition.text
minAge = data.eligibility.minimum_age.text
maxAge = data.eligibility.maximum_age.text
gender = data.eligibility.gender.text
healthyVolunteers = data.eligibility.healthy_volunteers.text
armType = []
intType = []
for each in data.findAll('intervention'):
intType.append(each.intervention_type.text)
for each in data.findAll('arm_group'):
armType.append(each.arm_group_type.text)
citedPMID = tryExceptCT(data, '.results_reference.PMID')
citedPMID = data.results_reference.PMID
print(citedPMID)
return officialTitle, studyType, allocation, interventionModel, primaryPurpose, masking, enrollment, condition, minAge, maxAge, gender, healthyVolunteers, armType, intType
但是,以下脚本并不总是有效,因为并非所有研究都会包含所有项目(即,KeyError 会出现)。为了解决这个问题,我可以简单地将每个语句包装在一个 try-except 捕获中,如下所示:
try:
studyType = data.study_type.text
except:
studyType = ""
但这似乎是一种不好的实现方式。什么是更好/更清洁的解决方案?
【问题讨论】:
-
网站为您提供了一种无需抓取即可下载数据的方式:clinicaltrials.gov/ct2/resources/download
-
@Blender 请参阅在 XML 中显示单个记录部分。
标签: python error-handling web-scraping data-extraction