【问题标题】:Scrapy xpath with following sibling between two h2 tags在两个 h2 标签之间具有以下兄弟的 Scrapy xpath
【发布时间】:2021-07-02 19:32:55
【问题描述】:

我有一个设计不佳的 HTML 页面,我试图使用 scrapy 从中提取数据。以下sn-p是我感兴趣的:

<html>
    <h2 class="schoolName">Graduate School of Business</h2>
        <ul title="Graduate School of Business departments - part 1"></ul>
        <ul title="Graduate School of Business departments - part 2"></ul>
        <ul title="Graduate School of Business departments - part 3"></ul>
   <h2 class="schoolName">School of Law</h2>
       <ul title="School of Law departments - part 1"></ul>
       <ul title="School of Law departments - part 2"></ul>
  <h2 class="schoolName">School of Medicine</h2>
      <ul title="School of Medicine departments - part 1"></ul>
</html>

我特别想知道学校的数量和每个学校下属的部门数量。 所以我找到了所有学校的名单如下:

>>> schools = response.xpath('//h2[@class="schoolName"]/text()').getall()
>>> schools
['Graduate School of Business', 'School of Law', 'School of Medicine']

然后对于每所学校,我找到它们下的部门如下:

>>> for school in schools:
...     print(school)
...     print(response.xpath(f'//h2[@class="schoolName"][text()[contains(.,"{school}")]]/following-sibling::ul/@title').extract())
...     print ("-----------------------------")
...
Graduate School of Business
['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 
 2', 'Graduate School of Business departments - part 3', 'School of Law departments - part 1', 
 'School of Law departments - part 2', 'School of Medicine departments - part 1']
-----------------------------
School of Law
['School of Law departments - part 1', 'School of Law departments - part 2', 'School of Medicine 
departments - part 1']
-----------------------------
School of Medicine
['School of Medicine departments - part 1']
-----------------------------

这显然没有按预期工作,因为以下兄弟正在选择所有 ul 标签,而不仅仅是两个 h2 标签之间的标签。我如何做到这一点?

【问题讨论】:

    标签: python web-scraping xpath scrapy


    【解决方案1】:

    一种技术是选择一个标记新信息块开始的公共分隔元素,使用count()preceding-sibling 测量其位置,然后选择具有相同编号的所有数据元素(加一个) 的分隔符前面的兄弟姐妹。

    在 iPython 外壳中:

    In [1]: from lxml import etree
    
    In [2]: string = '''<html>
       ...:     <h2 class="schoolName">Graduate School of Business</h2>
       ...:         <ul title="Graduate School of Business departments - part 1"></ul>
       ...:         <ul title="Graduate School of Business departments - part 2"></ul>
       ...:         <ul title="Graduate School of Business departments - part 3"></ul>
       ...:    <h2 class="schoolName">School of Law</h2>
       ...:        <ul title="School of Law departments - part 1"></ul>
       ...:        <ul title="School of Law departments - part 2"></ul>
       ...:   <h2 class="schoolName">School of Medicine</h2>
       ...:       <ul title="School of Medicine departments - part 1"></ul>
       ...: </html>'''
    
    In [3]: root = etree.fromstring(string)
    
    In [4]: schools = root.xpath('//h2[@class="schoolName"]/text()')
    
    In [5]: schools
    Out[5]: ['Graduate School of Business', 'School of Law', 'School of Medicine']
    
    In [6]: for school in schools:
       ...:     print (school)
       ...:     position = int(root.xpath(f'count(//h2[text()="{school}"]/preceding-sibling::h2) + 1'))
       ...:     print (f"Position: {position}")
       ...:     print (root.xpath(f'//ul[count(preceding-sibling::h2) = {position}]/@title'))
       ...: 
    Graduate School of Business
    Position: 1
    ['Graduate School of Business departments - part 1', 'Graduate School of Business departments - part 2', 'Graduate School of Business departments - part 3']
    School of Law
    Position: 2
    ['School of Law departments - part 1', 'School of Law departments - part 2']
    School of Medicine
    Position: 3
    ['School of Medicine departments - part 1']
    
    

    【讨论】:

      猜你喜欢
      • 2015-01-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-12-18
      • 2020-11-15
      • 2014-02-26
      • 2022-01-04
      • 2015-10-25
      相关资源
      最近更新 更多