【发布时间】:2018-02-11 21:44:47
【问题描述】:
我正在使用 scrapy 创建一个示例网络爬虫作为 Nameko 依赖提供程序,但它没有爬取任何页面。下面是代码
import scrapy
from scrapy import crawler
from nameko import extensions
from twisted.internet import reactor
class TestSpider(scrapy.Spider):
name = 'test_spider'
result = None
def parse(self, response):
TestSpider.result = {
'heading': response.css('h1::text').extract_first()
}
class ScrapyDependency(extensions.DependencyProvider):
def get_dependency(self, worker_ctx):
return self
def crawl(self, spider=None):
spider = TestSpider()
spider.name = 'test_spider'
spider.start_urls = ['http://www.example.com']
self.runner = crawler.CrawlerRunner()
self.runner.crawl(spider)
d = self.runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return spider.result
def run(self):
if not reactor.running:
reactor.run()
这是日志。
Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
Enabled item pipelines:
[]
Spider opened
Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Closing spider (finished)
Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 126088),
'log_count/INFO': 7,
'memusage/max': 59650048,
'memusage/startup': 59650048,
'start_time': datetime.datetime(2017, 9, 3, 12, 41, 40, 97747)}
Spider closed (finished)
在日志中我们可以看到它没有爬取单个页面,预计会爬取一个页面。
然而,如果我创建一个常规的CrawlerRunner 并抓取该页面,我会得到预期的结果,即{'heading': 'Example Domain'}。下面是代码:
import scrapy
class TestSpider(scrapy.Spider):
name = 'test_spider'
start_urls = ['http://www.example.com']
result = None
def parse(self, response):
TestSpider.result = {'heading': response.css('h1::text').extract_first()}
def crawl():
runner = crawler.CrawlerRunner()
runner.crawl(TestSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
if __name__ == '__main__':
crawl()
解决这个问题已经好几天了,我无法弄清楚何时使用 scrapy 爬虫作为 nameko 依赖提供程序无法爬取页面。请纠正我哪里出错了。
【问题讨论】:
-
你想通过这个实现什么?暂且不做实现,你的实际需求是什么?
-
我希望这是对 nameko 服务方法的依赖,这意味着 nameko 微服务框架将调用
ScrapyDependency().crawl()来处理请求(网络抓取请求)并返回结果。问题是这样使用时不会抓取页面。 -
您正在混合 nameko 和扭曲的服务器,不确定它们的凝胶效果如何。
-
这里你可以找到一个与nameko类似的redis实现。 github.com/etataurov/nameko-redis/blob/master/nameko_redis.py 试图在我的实现中遵循类似的路线。
标签: python scrapy twisted nameko