【发布时间】:2020-03-26 23:24:47
【问题描述】:
我已经编写了自己的scrapy下载中间件来简单地检查db是否存在request.url,如果存在则引发IgnoreRequestf
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
sql = """SELECT url FROM domain_sold WHERE url = %s;"""
try:
cursor = spider.db_connection.cursor()
cursor.execute(sql, (request.url,))
is_seen = cursor.fetchone()
cursor.close()
if is_seen:
raise IgnoreRequest('duplicate url {}'.format(request.url))
except (Exception, psycopg2.DatabaseError) as error:
self.logger.error(error)
return None
如果引发 IgnoreRequest,我希望蜘蛛会继续处理另一个请求,但在我的情况下,蜘蛛仍会继续抓取该请求并通过我的自定义管道通过项目管道。
我目前的 dl mw 设置如下
'DOWNLOADER_MIDDLEWARES':{ 'realestate.middlewares.RealestateDownloaderMiddleware': 99
任何人都可以建议为什么会发生这种情况。谢谢
【问题讨论】: