【发布时间】:2018-02-15 11:21:03
【问题描述】:
实际上,scrapy 文档解释了如何像这样链接两个 spyder
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
configure_logging()
runner = CrawlerRunner()
@defer.inlineCallbacks
def crawl():
yield runner.crawl(MySpider1)
yield runner.crawl(MySpider2)
reactor.stop()
crawl()
reactor.run() # the script will block here until the last crawl call is finished
但在我的用例中,MySpider2 需要使用transformFunction() 转换后由MySpider1 检索的信息。
所以我想要这样的东西:
def transformFunction():
... transforme data retrieved by spyder1 ...
return newdata
def crawl():
yield runner.crawl(MySpider1)
newdata = transformFunction()
yield runner.crawl(MySpider2, data=newData)
reactor.stop()
我想安排什么:
-
MySpider1开始,将data写入磁盘然后退出 -
transformFunction()将data转换为newdata -
MySpider2开始使用newData
那么我如何使用扭曲的反应器和 scrapy 来管理这种行为?
【问题讨论】:
标签: python asynchronous scrapy twisted yield