当我运行你的蜘蛛时,这是我在控制台上得到的:
$ scrapy runspider op.py
2016-11-09 21:34:51 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 21:34:51 [scrapy] INFO: Spider opened
2016-11-09 21:34:51 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 21:34:51 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 21:34:51 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 21:34:53 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 21:34:53 [scrapy] ERROR: Spider error processing <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
Traceback (most recent call last):
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/spiders/sitemap.py", line 44, in _parse_sitemap
s = Sitemap(body)
File "/home/paul/.virtualenvs/scrapy12/local/lib/python2.7/site-packages/scrapy/utils/sitemap.py", line 17, in __init__
rt = self._root.tag
AttributeError: 'NoneType' object has no attribute 'tag'
您可能已经注意到AttributeError 异常。
所以 scrapy 说它在解析站点地图响应正文时遇到了问题。
如果scrapy无法理解站点地图的内容,它就无法将内容解析为XML,因此无法跟踪任何<loc> URL,因此不会调用任何回调,因为它什么也没找到。
所以您实际上在 scrapy 中发现了一个错误(感谢报告):https://github.com/scrapy/scrapy/issues/2389
至于bug本身,
不同的子站点地图,例如http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz,作为 gzip 压缩的 .gz 文件“在线”发送(gzip 压缩两次 - 因此 HTTP 响应需要压缩两次)以正确解析为 XML。
Scrapy 不处理这种情况,因此会打印出异常。
这是一个基本的站点地图蜘蛛,它尝试对响应进行双重压缩:
from scrapy.utils.gz import gunzip
import scrapy
class CurrentHarvestSpider(scrapy.spiders.SitemapSpider):
name = "newegg"
allowed_domains = ["newegg.com"]
sitemap_urls = ['http://www.newegg.com/Siteindex_USA.xml']
def parse(self, response):
self.logger.info('parsing %r' % response.url)
def _get_sitemap_body(self, response):
body = super(CurrentHarvestSpider, self)._get_sitemap_body(response)
self.logger.debug("body[:32]: %r" % body[:32])
try:
body_unzipped_again = gunzip(body)
self.logger.debug("body_unzipped_again[:32]: %r" % body_unzipped_again[:100])
return body_unzipped_again
except:
pass
return body
这些日志显示 newegg 的 .xml.gz 站点地图确实需要两次压缩:
$ scrapy runspider spider.py
2016-11-09 13:10:56 [scrapy] INFO: Scrapy 1.2.1 started (bot: scrapybot)
(...)
2016-11-09 13:10:56 [scrapy] INFO: Spider opened
2016-11-09 13:10:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-11-09 13:10:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Siteindex_USA.xml> (referer: None)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding='
2016-11-09 13:10:57 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_store01.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:57 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xda\xef\x1eX\x00\x0bnewegg_sitemap_store01'
2016-11-09 13:10:57 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
2016-11-09 13:10:57 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.newegg.com/Hubs/SubCategory/ID-26> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-11-09 13:10:59 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz> (referer: http://www.newegg.com/Siteindex_USA.xml)
2016-11-09 13:10:59 [newegg] DEBUG: body[:32]: '\x1f\x8b\x08\x08\xe3\xfa\x1eX\x00\x0bnewegg_sitemap_product'
2016-11-09 13:10:59 [newegg] DEBUG: body_unzipped_again[:32]: '\xef\xbb\xbf<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"'
(...)
2016-11-09 13:11:02 [scrapy] DEBUG: Crawled (200) <GET http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512> (referer: http://www.newegg.com/Sitemap/USA/newegg_sitemap_product15.xml.gz)
(...)
2016-11-09 13:11:02 [newegg] INFO: parsing 'http://www.newegg.com/Product/Product.aspx?Item=9SIA04Y0766512'
(...)