【发布时间】:2013-12-30 06:43:59
【问题描述】:
我使用了一些代理来抓取一些网站。这是我在settings.py中做的:
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOAD_DELAY = 3 # 5,000 ms of delay
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
'myspider.comm.rotate_useragent.RotateUserAgentMiddleware' : 100,
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': 200,
'myspider.comm.random_proxy.RandomProxyMiddleware': 300,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 400,
}
我还有一个代理下载中间件,它有以下方法:
def process_request(self, request, spider):
log('Requesting url %s with proxy %s...' % (request.url, proxy))
def process_response(self, request, response, spider):
log('Response received from request url %s with proxy %s' % (request.url, proxy if proxy else 'nil'))
def process_exception(self, request, exception, spider):
log_msg('Failed to request url %s with proxy %s with exception %s' % (request.url, proxy if proxy else 'nil', str(exception)))
#retry again.
return request
由于proxy有时候不是很稳定,process_exception经常会提示很多请求失败的信息。这里的问题是失败的请求再也没有被尝试过。
如前所述,我设置了RETRY_TIMES和RETRY_HTTP_CODES设置,并且在代理中间件的process_exception方法中也返回了重试请求。
为什么 scrapy 不再重试失败请求,或者我如何确保至少尝试了我在 settings.py 中设置的 RETRY_TIMES 请求?
【问题讨论】:
标签: python web-scraping screen-scraping scrapy