【发布时间】:2014-12-25 19:25:36
【问题描述】:
使用scrapy,我想先从一些页面收集url,然后解析找到的每个url并生成item。
例如,代码是这样的:
def parse(self, response):
# collect urls first
urls = self.collect_urls(response)
# parse urls found
for url in urls:
self.parse_url(url) # will yield Item inside
def collect_urls(reponse):
urls = response.meta.get('urls')
if urls is None:
urls = set()
# do some logic of collecting urls from response into urls set
# ...
if is_still_has_data(response):
# continue collecting urls in other page
yield scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'},
meta={'urls': urls}, callback=self.collect_urls)
else:
return urls # error here
问题是我无法在具有yield 的函数内返回对象。
然后我将urls 设为类属性/成员,如下所示:
urls = set()
def parse(self, response):
# collect urls first
yield self.collect_urls(response)
# parse urls found
for url in urls:
self.parse_url(url) # will yield Item inside
def collect_urls(reponse):
# do some logic of collecting urls from response into urls set
# ...
if is_still_has_data(response):
# continue collecting urls in other page
return scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'},
callback=self.collect_urls)
这段代码的问题,在调用yield self.collect_urls(response)之后,它会直接继续到for url in urls:部分,而不是等待collect_urls函数完成。如果我删除yield,collect_urls 函数将只被调用一次,FormRequest 中的回调不起作用。似乎只有当 FormRequest 产生时回调才有效。
我知道可以将for url in urls: 部分移到collect_urls 函数中,但我想知道是否可以在scrapy 中实现我想要的代码模式?
【问题讨论】:
标签: python web-scraping scrapy