使用scrapy找到正确的分页选择器答案

【问题标题】：finding right selector for pagination with scrapy使用scrapy找到正确的分页选择器
【发布时间】：2019-06-12 11:54:59
【问题描述】：

我正在尝试从这个论坛中提取数据：

https://schwangerschaft.gofeminin.de/forum/all

我从第一页获取数据。我使用 css 选择器'li.selected > a::attr(href)' 不幸的是我无法从其他页面获取所有其他数据。

xpath 或 css 选择器进行分页的正确路径是什么？

Python：

import scrapy

class ForumSpider(scrapy.Spider):
    name = "pregnancy"

    def start_requests(self):
        url = 'https://schwangerschaft.gofeminin.de/forum/all'
        yield scrapy.Request(url, self.parse)


    def parse(self, response):
        for thread in response.css('div.af-thread-item'):
            yield{
                'threadTitle': thread.css('span.thread-title::text').extract_first(),
                'username': thread.css('div.user-name::text').extract_first()
            }
        next_page = response.css('li.selected > a::attr(href)').extract_first()
        if next_page is not None:
            yield scrapy.Request(response.urljoin(next_page))

HTML：

<nav class="af-pagination " role="navigation"><ul><li class="selected">
<a href="https://schwangerschaft.gofeminin.de/forum/all">1</a></li><li>
<a href="https://schwangerschaft.gofeminin.de/forum/all/p2">2</a></li><li>
<a href="https://schwangerschaft.gofeminin.de/forum/all/p3">3</a></li><li>
<a href="https://schwangerschaft.gofeminin.de/forum/all/p4">4</a></li><li>
<a href="https://schwangerschaft.gofeminin.de/forum/all/p5">5</a></li><li>
<a href="https://schwangerschaft.gofeminin.de/forum/all/p6">6</a></li><li>
<a href="https://schwangerschaft.gofeminin.de/forum/all/p7">7</a></li><li>
<a href="https://schwangerschaft.gofeminin.de/forum/all/p8">8</a></li><li>
...

下一页链接： https://schwangerschaft.gofeminin.de/forum/all/p2

【问题讨论】：

标签： python xpath scrapy css-selectors web-crawler

【解决方案1】：

鉴于此特定网站导航栏的构建方式，我喜欢在这些情况下使用 xpath。鉴于当前页面将有一个“selected”类，我会选择“selected”类，然后使用索引为1的“following-sibling”语法来获取以下标签。

在你的情况下：

response.xpath("//li[@class='selected']/following-sibling::li[1]/a/@href").extract_first()

所以无论你在哪个页面，你都在动态选择“下一个”页面。

【讨论】：

vezunchik answer 做同样的事情，甚至更短。我使用这种方法的原因是我遇到了分页/导航栏中没有“下一个”标签的情况。

【解决方案2】：

试试response.css('link[rel=next]::attr(href)').get()，应该可以。

【讨论】：