【问题标题】:Running more than one spider in a for loop在 for 循环中运行多个蜘蛛
【发布时间】:2015-11-21 23:10:15
【问题描述】:

我尝试实例化多个蜘蛛。第一个工作正常,但第二个给我一个错误:ReactorNotRestartable。

feeds = {
    'nasa': {
        'name': 'nasa',
        'url': 'https://www.nasa.gov/rss/dyn/breaking_news.rss',
        'start_urls': ['https://www.nasa.gov/rss/dyn/breaking_news.rss']
    },
    'xkcd': {
        'name': 'xkcd',
        'url': 'http://xkcd.com/rss.xml',
        'start_urls': ['http://xkcd.com/rss.xml']
    }    
}

通过上面的项目,我尝试在一个循环中运行两个蜘蛛,如下所示:

from scrapy.crawler import CrawlerProcess
from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):

    name = None

    def __init__(self, **kwargs):

        this_feed = feeds[self.name]
        self.start_urls = this_feed.get('start_urls')
        self.iterator = 'iternodes'
        self.itertag = 'items'
        super(MySpider, self).__init__(**kwargs)

def parse_node(self, response, node):
    pass


def start_crawler():
    process = CrawlerProcess({
        'USER_AGENT': CONFIG['USER_AGENT'],
        'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
    })

    for feed_name in feeds.keys():
        MySpider.name = feed_name
        process.crawl(MySpider)
        process.start() 

第二个循环的异常看起来像这样,蜘蛛打开了,但是:

...
2015-11-22 00:00:00 [scrapy] INFO: Spider opened
2015-11-22 00:00:00 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-22 00:00:00 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-11-21 23:54:05 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
Traceback (most recent call last):
  File "env/bin/start_crawler", line 9, in <module>
    load_entry_point('feed-crawler==0.0.1', 'console_scripts', 'start_crawler')()
  File "/Users/bling/py-feeds-crawler/feed_crawler/crawl.py", line 51, in start_crawler
    process.start() # the script will block here until the crawling is finished
  File "/Users/bling/py-feeds-crawler/env/lib/python2.7/site-packages/scrapy/crawler.py", line 251, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable

我是否必须使第一个 MySpider 无效或者我做错了什么并且需要改变它的工作原理。提前致谢。

【问题讨论】:

    标签: python scrapy twisted scrapy-spider


    【解决方案1】:

    看起来你必须为每个蜘蛛实例化一个进程,试试:

    def start_crawler():      
    
        for feed_name in feeds.keys():
            process = CrawlerProcess({
                'USER_AGENT': CONFIG['USER_AGENT'],
                'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
            })
            MySpider.name = feed_name
            process.crawl(MySpider)
            process.start() 
    

    【讨论】:

    • 确实更有意义,但仍然是同样的例外。
    【解决方案2】:

    解决方案是在循环中收集蜘蛛并在最后只启动一次进程。我的猜测,它与 Reactor 分配/释放有关。

    def start_crawler():
    
        process = CrawlerProcess({
            'USER_AGENT': CONFIG['USER_AGENT'],
            'DOWNLOAD_HANDLERS': {'s3': None} # disable for issues with boto
        })
    
        for feed_name in CONFIG['Feeds'].keys():
            MySpider.name = feed_name
            process.crawl(MySpider)
    
        process.start()
    

    感谢@eLRuLL 的回答,它启发了我找到这个解决方案。

    【讨论】:

      【解决方案3】:

      您可以在抓取中发送参数并在解析过程中使用它们。

      class MySpider(XMLFeedSpider):
          def __init__(self, name, **kwargs):
              super(MySpider, self).__init__(**kwargs)
      
              self.name = name
      
      
      def start_crawler():      
          process = CrawlerProcess({
              'USER_AGENT': CONFIG['USER_AGENT'],
              'DOWNLOAD_HANDLERS': {'s3': None} # boto issues
          })
      
          for feed_name in feeds.keys():
              process.crawl(MySpider, feed_name)
      
          process.start()
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-03-08
        相关资源
        最近更新 更多