【发布时间】:2021-07-02 19:32:55
【问题描述】:
我有一个设计不佳的 HTML 页面,我试图使用 scrapy 从中提取数据。以下sn-p是我感兴趣的:
<html>
<h2 class="schoolName">Graduate School of Business</h2>
<ul title="Graduate School of Business departments - part 1"></ul>
<ul title="Graduate School of Business departments - part 2"></ul>
<ul title="Graduate School of Business departments - part 3"></ul>
<h2 class="schoolName">School of Law</h2>
<ul title="School of Law departments - part 1"></ul>
<ul title="School of Law departments - part 2"></ul>
<h2 class="schoolName">School of Medicine</h2>
<ul title="School of Medicine departments - part 1"></ul>
</html>
我特别想知道学校的数量和每个学校下属的部门数量。 所以我找到了所有学校的名单如下:
>>> schools = response.xpath('//h2[@class="schoolName"]/text()').getall()
>>> schools
['Graduate School of Business', 'School of Law', 'School of Medicine']
然后对于每所学校,我找到它们下的部门如下:
>>> for school in schools:
... print(school)
... print(response.xpath(f'//h2[@class="schoolName"][text()[contains(.,"{school}")]]/following-sibling::ul/@title').extract())
... print ("-----------------------------")
...
Graduate School of Business
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part
2', 'Graduate School of Business departments - part 3', 'School of Law departments - part 1',
'School of Law departments - part 2', 'School of Medicine departments - part 1']
-----------------------------
School of Law
['School of Law departments - part 1', 'School of Law departments - part 2', 'School of Medicine
departments - part 1']
-----------------------------
School of Medicine
['School of Medicine departments - part 1']
-----------------------------
这显然没有按预期工作,因为以下兄弟正在选择所有 ul 标签,而不仅仅是两个 h2 标签之间的标签。我如何做到这一点?
【问题讨论】:
标签: python web-scraping xpath scrapy