使用 Scrapy 延迟加载网页的问题答案

【问题标题】：Problems loading web page lazily with Scrapy使用 Scrapy 延迟加载网页的问题
【发布时间】：2016-03-20 22:27:05
【问题描述】：

我想刮this page的文章。但是，当我向下滚动页面时，页面项目会通过 Ajax 加载。到目前为止，我一直在尝试模拟实现此目的的 POST 请求，但没有成功。这是描述我的问题的代码的 sn-p。

import scrapy
class eroskiSpider(scrapy.Spider):
    name = "eroski"
    allowed_domains = ['https://www.compraonline.com']
    start_urls = [
        'https://www.compraonline.grupoeroski.com/es/'
    ]
    counter = 0
    def parse(self, response):

        for sel in response.xpath('//nav[@class="navmenu"]/ul/li/div/ul/li'):

            cat_title = sel.xpath('a/@title')[0].extract()
            href = sel.xpath('a/@href')[0].extract()
            url = response.urljoin(href)

            print 'Parsing category ' + cat_title
            yield scrapy.Request(url, callback = self.parse_cat, dont_filter = True)
            break

    def parse_cat(self, response):

        category = response.xpath('//head/title/text()').extract_first()
        counter = 0
        for sel in response.xpath('//article'):

            counter = counter + 1
            print 'counter is ' + str(counter)

            description = sel.xpath('.//h2[contains(@class, "description_title")]/a/@title').extract_first()
            print description

        payload = {'pageNumber': '2', 't:zoneid': 'zoneScroll'}
        yield scrapy.FormRequest(url = response.url, formdata = payload, dont_filter=True)

如果您运行该代码，您可以看到它是如何为首次加载页面时出现的相同 20 个项目而永远循环的。因此，我使用 FormRequest 加载更多文章的意图是不正确的。有什么想法吗？

【问题讨论】：

你用的是什么scrapy版本？

标签： python web-scraping scrapy

【解决方案1】：

对不起。愚蠢的问题。我显然忘了使用回调。

yield scrapy.FormRequest(url = response.url, formdata = payload, dont_filter=True, callback = self.parse_cat)

现在我们实际上在第一个页码之后得到了第二个页码。啊啊啊我好傻。

【讨论】：

感谢分享解决方案，这是智能的第一个重要标志；）