Scrapy，如何仍然获取状态为 302 的内容（重定向）答案

【问题标题】：Scrapy, How to still get the content with status 302 (redirecting)Scrapy，如何仍然获取状态为 302 的内容（重定向）
【发布时间】：2017-08-08 12:50:06
【问题描述】：

这是我的简单蜘蛛代码（刚开始）：

def start_requests(self):
    urls = [
        'http://www.liputan6.com/search?q=bubarkan+hti&type=all',
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
    page = response.url.split("/")[-2]
    filename = 'quotes-%s.html' % page
    with open(filename, 'wb') as f:
        f.write(response.body)
    self.log('Saved file %s' % filename)

使用浏览器我可以正常访问 url 'http://www.liputan6.com/search?q=bubarkan+hti&type=all'。但是为什么用这个scrapy我得到302响应，并且我无法抓取页面..

请任何人告诉我，如何解决它..

【问题讨论】：

标签： python search scrapy web-crawler

【解决方案1】：

网页似乎需要一些 cookie，如果找不到这些 cookie，它会重定向到索引页面。

我通过添加这些 cookie 使其工作：js_enabled=true; is_cookie_active=true;:

$scrapy shell "http://www.liputan6.com/search?q=bubarkan+hti&type=all"
# redirect happens
>[1]: response.url
<[1]: 'http://www.liputan6.com'
# add cookie to request:
>[2]: request.headers['Cookie'] = 'js_enabled=true; is_cookie_active=true;'
>[3]: fetch(request)
# redirect no longer happens
>[4]: response.url
<[4]: 'http://www.liputan6.com/search?q=bubarkan+hti&type=all'

编辑：为您的代码尝试：

 def start_requests(self):
    urls = [
        'http://www.liputan6.com/search?q=bubarkan+hti&type=all',
    ]
    for url in urls:
        req= scrapy.Request(url=url, callback=self.parse)
        req.headers['Cookie'] = 'js_enabled=true; is_cookie_active=true;'
        yield req

def parse(self, response):   
    # 200 response here

【讨论】：

嗨@Granitosaurus 感谢您的建议，我从脚本运行scrapy，scrapy.readthedocs.io/en/latest/topics/… 我在其中添加了代码“cookie”？
嗨，我已经尝试了你的建议，但我得到了 2017-08-08 07:15:58 [scrapy.downloadermiddlewares.redirect] 调试：重定向（302）到 liputan6.com>来自 liputan6.com/search?q=bubarkan+hti&type=all> 2017-08-08 07:15:59 [scrapy.core.engine] 调试：已爬网 (200) liputan6.com>（引用者：无）。我仍然得到重定向 302
但对我来说，即使是控制台也不允许传递任何缓存输入。如何在 settings.py 文件中做同样的事情？