【问题标题】:Can't get Scrapy to parse and follow 301, 302 redirects无法让 Scrapy 解析和遵循 301、302 重定向
【发布时间】:2025-11-27 04:35:02
【问题描述】:

我正在尝试编写一个非常简单的网站爬虫来列出 URL 以及 200、301、302 和 404 http 状态代码的引用和状态代码。

事实证明,Scrapy 效果很好,我的脚本正确地使用它来抓取网站,并且可以毫无问题地列出具有 200 和 404 状态代码的网址。

问题是:我找不到如何让 scrapy 跟随重定向并解析/输出它们。我可以让一个工作,但不能同时工作。

到目前为止我已经尝试过:

  • 设置meta={'dont_redirect':True}和设置REDIRECTS_ENABLED = False

  • 将 301、302 添加到 handle_httpstatus_list

  • 更改重定向中间件文档中指定的设置

  • 阅读重定向中间件代码以获得洞察

  • 以上所有的各种组合

  • 其他随机的东西

这里是public repo,如果您想查看代码。

【问题讨论】:

  • 我不确定 Scrapy 是否可以自己处理 301 或其他重定向。我认为scrapy工作流程是:导航到url,解析html并让用户完成其余的工作......检查这个答案,它可能会有所帮助:*.com/questions/36124429/…
  • 我读过的所有答案都说 Scrapy 支持 301,但我尝试的所有解决方案都失败了。我尝试了您找到的答案,但它不起作用。我在使用 yield 时遇到了新问题。我无法将 item 返回到解析器,因此它不会输出任何内容。此外,Scrapy 的行为看起来是一样的。 Branch with suggestion code

标签: python scrapy


【解决方案1】:

如果您想解析 301 和 302 响应,并同时关注它们,请要求您的回调处理 301 和 302 并模仿 RedirectMiddleware 的行为。

测试 1(不工作)

让我们从一个简单的蜘蛛开始进行说明(尚未按您的预期工作):

import scrapy


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    def parse(self, response):
        self.logger.info("got response for %r" % response.url)

现在,蜘蛛需要 2 个页面,第二个应该重定向到 http://www.example.com

$ scrapy runspider test.py
2016-09-30 11:28:17 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:28:18 [scrapy] DEBUG: Redirecting (302) to <GET http://example.com/> from <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F>
2016-09-30 11:28:18 [handle] INFO: got response for 'https://httpbin.org/get'
2016-09-30 11:28:18 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: None)
2016-09-30 11:28:18 [handle] INFO: got response for 'http://example.com/'
2016-09-30 11:28:18 [scrapy] INFO: Spider closed (finished)

302 由 RedirectMiddleware 自动处理,不会传递给您的回调。

测试2(还是不太对)

让我们配置蜘蛛在回调中处理 301 和 302,using handle_httpstatus_list

import scrapy


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    handle_httpstatus_list = [301, 302]
    def parse(self, response):
        self.logger.info("got response %d for %r" % (response.status, response.url))

让我们运行它:

$ scrapy runspider test.py
2016-09-30 11:33:32 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:33:32 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:33:33 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:33:33 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:33:33 [scrapy] INFO: Spider closed (finished)

在这里,我们缺少重定向。

测试 3(工作)

在蜘蛛回调中执行same as RedirectMiddleware

from six.moves.urllib.parse import urljoin

import scrapy
from scrapy.utils.python import to_native_str


class HandleSpider(scrapy.Spider):
    name = "handle"
    start_urls = (
        'https://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F',
    )
    handle_httpstatus_list = [301, 302]
    def parse(self, response):
        self.logger.info("got response %d for %r" % (response.status, response.url))

        # do something with the response here...

        # handle redirection
        # this is copied/adapted from RedirectMiddleware
        if response.status >= 300 and response.status < 400:

            # HTTP header is ascii or latin1, redirected url will be percent-encoded utf-8
            location = to_native_str(response.headers['location'].decode('latin1'))

            # get the original request
            request = response.request
            # and the URL we got redirected to
            redirected_url = urljoin(request.url, location)

            if response.status in (301, 307) or request.method == 'HEAD':
                redirected = request.replace(url=redirected_url)
                yield redirected
            else:
                redirected = request.replace(url=redirected_url, method='GET', body='')
                redirected.headers.pop('Content-Type', None)
                redirected.headers.pop('Content-Length', None)
                yield redirected

然后再次运行蜘蛛:

$ scrapy runspider test.py
2016-09-30 11:45:20 [scrapy] INFO: Scrapy 1.1.3 started (bot: scrapybot)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F> (referer: None)
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET https://httpbin.org/get> (referer: None)
2016-09-30 11:45:21 [handle] INFO: got response 302 for 'https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F'
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'https://httpbin.org/get'
2016-09-30 11:45:21 [scrapy] DEBUG: Crawled (200) <GET http://example.com/> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fexample.com%2F)
2016-09-30 11:45:21 [handle] INFO: got response 200 for 'http://example.com/'
2016-09-30 11:45:21 [scrapy] INFO: Spider closed (finished)

我们被重定向到http://www.example.com,我们也通过我们的回调得到了响应。

【讨论】: