在scrapy中设置粘性cookie答案

【问题标题】：Setting sticky cookie in scrapy在scrapy中设置粘性cookie
【发布时间】：2012-08-24 19:18:02
【问题描述】：

我正在抓取的网站有 javascript，它设置一个 cookie 并在后端检查它以确保 js 已启用。从 html 代码中提取 cookie 很简单，但是在 scrapy 中设置它似乎是个问题。所以我的代码是：

from scrapy.contrib.spiders.init import InitSpider

class TestSpider(InitSpider):
    ...
    rules = (Rule(SgmlLinkExtractor(allow=('products/./index\.html', )), callback='parse_page'),)

    def init_request(self):
        return Request(url = self.init_url, callback=self.parse_js)

    def parse_js(self, response):
        match = re.search('setCookie\(\'(.+?)\',\s*?\'(.+?)\',', response.body, re.M)
        if match:
            cookie = match.group(1)
            value = match.group(2)
        else:
            raise BaseException("Did not find the cookie", response.body)
        return Request(url=self.test_page, callback=self.check_test_page, cookies={cookie:value})

    def check_test_page(self, response):
        if 'Welcome' in response.body:
            self.initialized()

    def parse_page(self, response):
        scraping....

我可以看到内容在check_test_page 中可用，cookie 运行良好。但它甚至从未到达parse_page，因为没有正确cookie 的CrawlSpider 看不到任何链接。有没有办法在抓取会话期间设置 cookie？还是我必须使用 BaseSpider 并将 cookie 手动添加到每个请求中？

一个不太理想的选择是通过scrapy配置文件以某种方式设置cookie（值似乎永远不会改变）。这可能吗？

【问题讨论】：

Scrapy 默认传递所有 cookie：doc.scrapy.org/en/latest/…
这是服务器设置的cookies。据我所知，无法从客户端添加永久 cookie（scrapy）。必须为每个请求单独完成

标签： python cookies scrapy

【解决方案1】：

我之前没用过InitSpider。

查看scrapy.contrib.spiders.init.InitSpider中的代码我明白了：

def initialized(self, response=None):
    """This method must be set as the callback of your last initialization
    request. See self.init_request() docstring for more info.
    """
    self._init_complete = True
    reqs = self._postinit_reqs[:]
    del self._postinit_reqs
    return reqs

def init_request(self):
    """This function should return one initialization request, with the
    self.initialized method as callback. When the self.initialized method
    is called this spider is considered initialized. If you need to perform
    several requests for initializing your spider, you can do so by using
    different callbacks. The only requirement is that the final callback
    (of the last initialization request) must be self.initialized. 

    The default implementation calls self.initialized immediately, and
    means that no initialization is needed. This method should be
    overridden only when you need to perform requests to initialize your
    spider
    """
    return self.initialized()

你写道：

我可以看到内容在check_test_page，cookie 中可用完美运行。但它甚至从未到达parse_page，因为 CrawlSpider 没有正确的 cookie 看不到任何链接。

我认为parse_page 没有被调用，因为您没有使用self.initialized 作为回调发出请求。

我认为这应该可行：

def check_test_page(self, response):
    if 'Welcome' in response.body:
        return self.initialized()

【讨论】：

你说得对，我自己看过源代码。但事实证明，InitSpider 无论如何都是 BaseSpider。所以看起来像 1）在这种情况下无法使用 CrawlSpider 2）无法设置粘性 cookie

【解决方案2】：

原来InitSpider是一个BaseSpider。所以看起来 1) 在这种情况下无法使用 CrawlSpider 2) 无法设置粘性 cookie

【讨论】：