从脚本运行 Scrapy - 挂起答案

【问题标题】：Running Scrapy from a script - Hangs从脚本运行 Scrapy - 挂起
【发布时间】：2011-09-23 13:22:30
【问题描述】：

我正在尝试从here 讨论的脚本中运行 scrapy。它建议使用this sn-p，但是当我这样做时，它会无限期地挂起。这是在 .10 版中写回的；还兼容目前的stable吗？

【问题讨论】：

这个问题和答案可能已经准备好更新了。这是a recent snippet from Scrapy。它有效，但对我来说，问题变成了：如何停止 Twisted reactor 并在完成后继续前进？

【解决方案1】：

from scrapy import signals, log
from scrapy.xlib.pydispatch import dispatcher
from scrapy.crawler import CrawlerProcess
from scrapy.conf import settings
from scrapy.http import Request

def handleSpiderIdle(spider):
    '''Handle spider idle event.''' # http://doc.scrapy.org/topics/signals.html#spider-idle
    print '\nSpider idle: %s. Restarting it... ' % spider.name
    for url in spider.start_urls: # reschedule start urls
        spider.crawler.engine.crawl(Request(url, dont_filter=True), spider)

mySettings = {'LOG_ENABLED': True, 'ITEM_PIPELINES': 'mybot.pipeline.validate.ValidateMyItem'} # global settings http://doc.scrapy.org/topics/settings.html

settings.overrides.update(mySettings)

crawlerProcess = CrawlerProcess(settings)
crawlerProcess.install()
crawlerProcess.configure()

class MySpider(BaseSpider):
    start_urls = ['http://site_to_scrape']
    def parse(self, response):
        yield item

spider = MySpider() # create a spider ourselves
crawlerProcess.queue.append_spider(spider) # add it to spiders pool

dispatcher.connect(handleSpiderIdle, signals.spider_idle) # use this if you need to handle idle event (restart spider?)

log.start() # depends on LOG_ENABLED
print "Starting crawler."
crawlerProcess.start()
print "Crawler stopped."

更新：

如果您还需要为每个蜘蛛设置设置，请参阅此示例：

for spiderConfig in spiderConfigs:
    spiderConfig = spiderConfig.copy() # a dictionary similar to the one with global settings above
    spiderName = spiderConfig.pop('name') # name of the spider is in the configs - i can use the same spider in several instances - giving them different names
    spiderModuleName = spiderConfig.pop('spiderClass') # module with the spider is in the settings
    spiderModule = __import__(spiderModuleName, {}, {}, ['']) # import that module
    SpiderClass = spiderModule.Spider # spider class is named 'Spider'
    spider = SpiderClass(name = spiderName, **spiderConfig) # create the spider with given particular settings
    crawlerProcess.queue.append_spider(spider) # add the spider to spider pool

蜘蛛文件中的设置示例：

name = punderhere_com    
allowed_domains = plunderhere.com
spiderClass = scraper.spiders.plunderhere_com
start_urls = http://www.plunderhere.com/categories.php?

【讨论】：

我得到了this 回溯。我的scrapy项目被命名为scraper。会不会是这个问题？
我认为这就是问题所在。这是来自一个真实的项目。您可以删除对刮板的引用。您只需要对蜘蛛进行一些设置。
所以在我删除对刮板的引用后，我该如何导入我的项目的设置？
我做了一些cmets。您需要进行一些更改以使其工作 - 拥有一个有效的管道，完全实现 MySpider 类，设置所有必要的设置。