如何将新 URL 传递给 Scrapy Crawler答案

【问题标题】：How Do You Pass New URLs to a Scrapy Crawler如何将新 URL 传递给 Scrapy Crawler
【发布时间】：2013-05-18 06:30:48
【问题描述】：

我想让一个scrapy爬虫在一个可能使用something like this的芹菜任务工作者中不断运行。或建议in the docs 这个想法是使用爬虫来查询返回 XML 响应的外部 API。我想将我想查询的 URL（或查询参数并让爬虫构建 URL）传递给爬虫，爬虫会进行 URL 调用，并将提取的项目返回给我。一旦它开始运行，我如何将我想要获取的这个新 URL 传递给爬虫。我不想每次给它一个新的 URL 时都重新启动爬虫，而是希望爬虫闲置等待 URL 爬取。

我发现在另一个 python 进程中运行 scrapy 的两种方法使用一个新进程来运行爬虫。我不想每次我想爬取 URL 时都分叉和拆除一个新进程，因为这是相当昂贵和不必要的。

【问题讨论】：

标签： python django multithreading celery scrapy

【解决方案1】：

只要有一个蜘蛛来轮询一个数据库（或文件？），当出现一个新的 URL 时，它会为它创建并产生一个新的 Request() 对象。

您可以很容易地手动构建它。可能有比这更好的方法，但这基本上就是我为开放代理刮板所做的。蜘蛛从数据库中获取所有“潜在”代理的列表，并为每个代理生成一个 Request() 对象 - 当它们返回时，它们会被发送到链中并由下游中间件验证，它们的记录由项目管道。

【讨论】：

是的，我考虑过类似的事情，即使使用github.com/darkrho/scrapy-redis，但我计划将爬虫本身作为芹菜任务运行，我认为这更容易管理。我可能需要多考虑一下，是否让它在 Celery 内部运行时轮询 redis 是否有太多的集群绒毛潜力。我想保留 Celery 的主要原因是因为有许多工具可以管理工作人员和创建工作流程（如画布）。那么对原始问题有什么想法吗？
外部轮询的另一种方法是增加 scrapyd - 请注意，它有一个 JSON 和（其他）API，您可以连接并启动/停止作业等。而不是尝试修改一个正在运行的蜘蛛——也许你只是做一些服务器池并启动一个通用蜘蛛的新实例？然后，您可以避免任何第三方仲裁并将其全部集中在一个屋檐下。在某个地方，我将github.com/jrydberg/txgossip‎ 集成到了scrapyd 中完成了一半——我的想法是创建一个用于抓取的点对点小丑计算机，您可以通过注入新的“工作”来管理它。

【解决方案2】：

您可以使用消息队列（如 IronMQ--full 披露，我为 IronMQ 作为开发人员传播者的公司工作）来传递 URL。

然后在您的爬虫中，从队列中轮询 URL，并根据您检索到的消息进行爬取。

您链接的示例可以更新（这是未经测试的伪代码，但您应该了解基本概念）：

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
from iron-mq import IronMQ

mq = IronMQ()
q = mq.queue("scrape_queue")
crawler = Crawler(Settings())
crawler.configure()
while True: # poll forever
    msg = q.get(timeout=120) # get messages from queue
                             # timeout is the number of seconds the message will be reserved for, making sure no other crawlers get that message. Set it to a safe value (the max amount of time it will take you to crawl a page)
    if len(msg["messages"]) < 1: # if there are no messages waiting to be crawled
        time.sleep(1) # wait one second
        continue # try again
    spider = FollowAllSpider(domain=msg["messages"][0]["body"]) # crawl the domain in the message
    crawler.crawl(spider)
    crawler.start()
    log.start()
    reactor.run() # the script will block here
    q.delete(msg["messages"][0]["id"]) # when you're done with the message, delete it

【讨论】：

你会从哪个文件中调用蜘蛛？