Python scrapy SitemapSpider 回调未被调用答案

【问题标题】：Python scrapy SitemapSpider callbacks not being calledPython scrapy SitemapSpider 回调未被调用
【发布时间】：2016-11-09 01:59:50
【问题描述】：

我在这里阅读了有关 SitemapSpider 类的文档：https://scrapy.readthedocs.io/en/latest/topics/spiders.html#sitemapspider

这是我的代码：

class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
    name = "newegg"
    allowed_domains = ["newegg.com"]
    sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']
    # if I comment this out, then the parse function should be called by default for every link, but it doesn't
    sitemap_rules = [('/Product', 'parse_product_url'), ('product','parse_product_url')]
    sitemap_follow = ['/newegg_sitemap_product', '/Product']

    def parse(self, response):
        with open("/home/dan/debug/newegg_crawler.log", "a") as log:
        log.write("logging from parse " + response.url)
        self.this_function_does_not_exist()
        yield Request(response.url, callback=self.some_callback)

    def some_callback(self, response):
        with open("/home/dan/debug/newegg_crawler.log", "a") as log:
            log.write("logging from some_callback " + response.url)
        self.this_function_does_not_exist()

    def parse_product_url(self, response):
        with open("/home/dan/debug/newegg_crawler.log ", "a") as log:
            log.write("logging from parse_product_url" + response.url)
        self.this_function_does_not_exist()

安装了scrapy可以成功运行。
运行pip install scrapy 获取scrapy 并使用工作目录中的scrapy crawl newegg 执行。

我的问题是，为什么没有调用这些回调？文档声称应该调用sitemap_rules 中定义的回调。如果我将其注释掉，那么默认情况下应该调用parse()，但它仍然不会被调用。文档只是 100% 错误吗？我正在检查我设置的这个日志文件，但没有写入任何内容。我什至将文件的权限设置为 777。此外，我正在调用一个不存在的函数，它应该会导致错误以证明没有调用这些函数，但不会发生错误。我做错了什么？

【问题讨论】：

newegg 站点地图中的loc 看起来像是包含压缩文件gz。你能用命令行日志更新你的问题吗？
fyi，我在 scrapy 上打开了一个问题：github.com/scrapy/scrapy/issues/2389

标签： python xml web-scraping scrapy web-crawler

【解决方案1】：

当我运行你的蜘蛛时，这是我在控制台上得到的：

$ scrapy runspider op.py 
2016-11-09 21:34:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 21:34:51 [scrapy] INFO: Spider opened
2016-11-09 21:34:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 21:34:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 21:34:51 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 21:34:53 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 21:34:53 [scrapy] ERROR: Spider error processing <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
Traceback (most recent call last):
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 44, in _parse_sitemap
    s = Sitemap(body)
  File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/sitemap.py", line 17, in __init__
    rt = self._root.tag
AttributeError: 'NoneType' object has no attribute 'tag'

您可能已经注意到AttributeError 异常。所以 scrapy 说它在解析站点地图响应正文时遇到了问题。

如果scrapy无法理解站点地图的内容，它就无法将内容解析为XML，因此无法跟踪任何<loc> URL，因此不会调用任何回调，因为它什么也没找到。

所以您实际上在 scrapy 中发现了一个错误（感谢报告）：https://github.com/scrapy/scrapy/issues/2389

至于bug本身，

不同的子站点地图，例如http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz，作为 gzip 压缩的 .gz 文件“在线”发送（gzip 压缩两次 - 因此 HTTP 响应需要压缩两次）以正确解析为 XML。

Scrapy 不处理这种情况，因此会打印出异常。

这是一个基本的站点地图蜘蛛，它尝试对响应进行双重压缩：

from scrapy.utils.gz import gunzip
import scrapy


class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
    name = "newegg"
    allowed_domains = ["newegg.com"]
    sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']

    def parse(self, response):
        self.logger.info('parsing %r' % response.url)

    def _get_sitemap_body(self, response):
        body = super(CurrentHarvestSpider, self)._get_sitemap_body(response)
        self.logger.debug("body[:32]: %r" % body[:32])
        try:
            body_unzipped_again = gunzip(body)
            self.logger.debug("body_unzipped_again[:32]: %r" % body_unzipped_again[:100])
            return body_unzipped_again
        except:
            pass
        return body

这些日志显示 newegg 的 .xml.gz 站点地图确实需要两次压缩：

$ scrapy runspider spider.py 
2016-11-09 13:10:56 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 13:10:56 [scrapy] INFO: Spider opened
2016-11-09 13:10:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 13:10:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding='
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xda\xef\x1eX\x00\x0bnewegg_sitemap_store01'
2016-11-09 13:10:57 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
2016-11-09 13:10:57 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.newegg.com/Hubs/SubCategory/ID-26> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-11-09 13:10:59 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:59 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xe3\xfa\x1eX\x00\x0bnewegg_sitemap_product'
2016-11-09 13:10:59 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
(...)
2016-11-09 13:11:02 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512> (referer: http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz)
(...)
2016-11-09 13:11:02 [newegg] INFO: parsing 'http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512'
(...)

【讨论】：

有趣。但是 gunzipping 两次与没有被调用的函数有什么关系呢？它在某个地方失败了吗？具体在哪里？它不应该仍然调用回调吗？或者做点什么？我不明白，它是否默默地失败了？这太讽刺了，因为我觉得它太大声了；因此为什么我试图手动记录调试输出。我无法处理筛选和过滤不必要的废话打印到标准输出的巨大转储。它让我的眼睛流血。我也不会被那些记录不充分且令人费解的记录器模块所困扰。我试过了，时间太长了。 Scrapy 需要一些工作。
查看我的更新答案。对我来说，当我运行你的蜘蛛时，scrapy 非常直言不讳地表示有一个错误。我不知道您所说的 “我无法处理筛选和过滤大量不必要的废话打印到标准输出的转储。”。如果是杂乱无章的日志，那“废话”实际上表明存在问题。如果是别的东西，你不需要贬低你的观点。 “Scrapy 需要一些工作”：这是肯定的，它需要用户社区向它抛出他们不同的用例，以便改进代码。感谢您的报告。如果您愿意，您还可以帮助修复错误。
这真的更多是我自己的问题，而不是一个草率的问题。你有一个摘录在那里。我能读懂。但与其他杂乱的输出相比，它很小。对我来说，这就像大海捞针，我经常重读台词，失去自己的位置，被所有这些台词弄得不知所措。我很幸运能在控制台输出中找到您发现的内容。我的抱怨是，通常当您的程序遇到运行时错误时，程序会停止执行并返回错误，对吗？好吧，scrapy 没有。它把一根针扔进大海捞针，然后继续前进。
我收回这一切。我不是很彻底。我有一个稍微修改过的代码版本，昨天它没有抛出错误或调用回调（可能是你在回答中概述的原因）。我在这里调整了代码，但未能测试我是否可以重现该问题，现在调整后的代码抛出了我可以正常阅读的错误。抱歉，scrapy，不是有意贬低你。
哦，感谢您的帮助并扩展答案。事实证明它非常有帮助，而且事情变得更有意义了。