ReactorNotRestartable - 扭曲和scrapy答案

【问题标题】：ReactorNotRestartable - Twisted and scrapyReactorNotRestartable - 扭曲和scrapy
【发布时间】：2017-11-30 23:11:08
【问题描述】：

在您将我链接到与此相关的其他答案之前，请注意我已阅读它们并且仍然有点困惑。好的，我们开始吧。

所以我在 Django 中创建了一个 webapp。我正在导入最新的 scrapy 库来抓取网站。我没有使用芹菜（我对此知之甚少，但在与此相关的其他主题中看到过）。

我们网站的其中一个网址 /crawl/ 用于启动爬虫运行。这是我们网站中唯一需要使用scrapy 的网址。这是访问 url 时调用的函数：

def crawl(request):
  configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
  runner = CrawlerRunner()

  d = runner.crawl(ReviewSpider)
  d.addBoth(lambda _: reactor.stop())
  reactor.run() # the script will block here until the crawling is finished

  return render(request, 'index.html')

您会注意到这是对他们网站上的 scrapy 教程的改编。当服务器开始运行时第一次访问这个 url，一切都按预期工作。第二次及以后，抛出 ReactorNotRestartable 异常。据我了解，当一个已经停止的反应堆被发出重新启动的命令时会发生这种异常，这是不可能的。

查看示例代码，我假设“runner = CrawlerRunner()”行将在每次访问此 url 时返回一个 ~new~ 反应器。但我相信也许我对扭曲反应堆的理解并不完全清楚。

每次访问此 URL 时，我将如何获取和运行新反应器？

非常感谢

【问题讨论】：

标签： django web-applications scrapy twisted reactor

【解决方案1】：

一般来说，您不能拥有新的反应堆。有一个全球性的。这显然是一个错误，也许将来会更正，但这是目前的情况。

您也许可以使用Crochet 管理在单独线程中运行的单个反应器（在整个过程的生命周期内 - 不重复启动和停止）。

考虑the example from the Crochet docs：

#!/usr/bin/python
"""
Do a DNS lookup using Twisted's APIs.
"""
from __future__ import print_function

# The Twisted code we'll be using:
from twisted.names import client

from crochet import setup, wait_for
setup()


# Crochet layer, wrapping Twisted's DNS library in a blocking call.
@wait_for(timeout=5.0)
def gethostbyname(name):
    """Lookup the IP of a given hostname.

    Unlike socket.gethostbyname() which can take an arbitrary amount of time
    to finish, this function will raise crochet.TimeoutError if more than 5
    seconds elapse without an answer being received.
    """
    d = client.lookupAddress(name)
    d.addCallback(lambda result: result[0][0].payload.dottedQuad())
    return d


if __name__ == '__main__':
    # Application code using the public API - notice it works in a normal
    # blocking manner, with no event loop visible:
    import sys
    name = sys.argv[1]
    ip = gethostbyname(name)
    print(name, "->", ip)

这为您提供了一个使用 Twisted API 实现的阻塞 gethostbyname 函数。该实现使用twisted.names.client，它仅依赖于能够导入全局反应器。

请注意，没有 reactor.run 或 reactor.stop 电话 - 只有钩针 setup 电话。

【讨论】：

但是对于 django 项目我该怎么做呢？如何使反应堆在网站开始时启动并在网站关闭时结束？以及以后每次爬虫需要运行时如何引用它？
回答你的第一个问题，这就是钩针的作用。 :) 第二部分的答案可能有多种形式——可能创建一个引用反应器的对象，或者可能只是依赖全局反应器导入，始终为您提供相同的反应器。
看来这看似简单。希望真的就这么简单。我只是注释掉了这些行：d.addBoth(lambda _: reactor.stop()) reactor.run() 并将导入和设置调用添加到文件顶部。它似乎工作顺利。我对反应堆没有很好的了解，所以希望没有我遗漏的东西。不过谢谢！
你能展示一下你是怎么做到的吗？也许在回答中，这将非常有帮助，谢谢