【发布时间】:2012-11-06 09:45:58
【问题描述】:
我是 Scrapy 的新手,我正在寻找一种从 Python 脚本运行它的方法。我找到了 2 个可以解释这一点的来源:
http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/
http://snipplr.com/view/67006/using-scrapy-from-a-script/
我不知道应该把蜘蛛代码放在哪里以及如何从主函数中调用它。请帮忙。这是示例代码:
# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script.
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.
#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports
from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue
class CrawlerScript():
def __init__(self):
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def _crawl(self, queue, spider_name):
spider = self.crawler.spiders.create(spider_name)
if spider:
self.crawler.queue.append_spider(spider)
self.crawler.start()
self.crawler.stop()
queue.put(self.items)
def crawl(self, spider):
queue = Queue()
p = Process(target=self._crawl, args=(queue, spider,))
p.start()
p.join()
return queue.get(True)
# Usage
if __name__ == "__main__":
log.start()
"""
This example runs spider1 and then spider2 three times.
"""
items = list()
crawler = CrawlerScript()
items.append(crawler.crawl('spider1'))
for i in range(3):
items.append(crawler.crawl('spider2'))
print items
# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date : Oct 24, 2010
谢谢。
【问题讨论】:
-
我用web-scraping 替换了不适当的标签data-mining(= 高级数据分析)。为了改进您的问题,请确保它包括:您尝试了什么?和发生了什么,当您尝试!
-
那些例子已经过时了——它们不再适用于当前的 Scrapy。
-
感谢您的评论。你建议我应该怎么做才能从脚本中调用蜘蛛?我正在使用最新的 Scrapy
-
交叉引用 this answer - 应该为您提供有关如何从脚本运行 Scrapy 的详细概述。
-
AttributeError: 模块 'scrapy.log' 没有属性 'start'
标签: python web-scraping web-crawler scrapy