如何在scrapy中获取队列中的请求数？答案

【问题标题】：How to get the number of requests in queue in scrapy?如何在scrapy中获取队列中的请求数？
【发布时间】：2025-12-06 12:00:02
【问题描述】：

我正在使用scrapy 来抓取一些网站。如何获取队列中的请求数？

我查看了scrapy 源代码，发现scrapy.core.scheduler.Scheduler 可能会导致我的答案。见：https://github.com/scrapy/scrapy/blob/0.24/scrapy/core/scheduler.py

两个问题：

如何访问我的蜘蛛类中的调度程序？
调度程序类中的self.dqs 和self.mqs 是什么意思？

【问题讨论】：

建议的方法对您有用吗？
@aberna 我还是不知道怎么获取爬虫的scheduler实例。

标签： python scrapy

【解决方案1】：

这花了我一段时间才弄清楚，但这是我使用的：

self.crawler.engine.slot.scheduler

那是调度器的实例。然后，您可以调用它的 __len__() 方法，或者如果您只需要 true/false 来处理待处理的请求，请执行以下操作：

self.crawler.engine.scheduler_cls.has_pending_requests(self.crawler.engine.slot.scheduler)

请注意，即使队列为空，仍然可能有正在运行的请求。要检查当前有多少请求正在运行，请使用：

len(self.crawler.engine.slot.inprogress)

【讨论】：

len(self.crawler.engine.slot.scheduler) 完美运行

【解决方案2】：

回答您问题的方法：

来自文档 http://readthedocs.org/docs/scrapy/en/0.14/faq.html#does-scrapy-crawl-in-breath-first-or-depth-first-order

默认情况下，Scrapy 使用 LIFO 队列来存储挂起的请求，这基本上意味着它按 DFO 顺序爬行。这个订单更多在大多数情况下很方便。如果您确实想以真正的 BFO 顺序爬行，您可以通过设置以下设置来做到这一点：

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

所以self.dqs 和self.mqs 是自动的（磁盘队列调度器和内存队列调度器。

从另一个 SO 答案中，有一个关于访问 (Storing scrapy queue in a database) scrapy internale queque rappresentation queuelib https://github.com/scrapy/queuelib

的建议

一旦你得到它，你只需要计算队列的长度。

【讨论】：

为什么这在某种程度上回答了问题的第二部分（self.dqs 和 self.mqs 是什么）它对访问蜘蛛的调度程序类并获得所问的主要问题没有任何帮助.

【解决方案3】：

发现这个问题是因为我试图为一个爬虫蜘蛛实现一个进度条，并认为我会分享我的发现。对于当前版本的scrapy（我使用的是2.5），我建议使用带有自定义扩展名的signals（尽管这可能取决于您要对总数进行什么操作）。

基本上，您希望绑定到request_scheduled signal 并在每次触发该信号时增加您的总数，并绑定到request_dropped signal 并在触发该信号时递减您的标题.

如果您想知道有多少已安排但未处理，那么您可以做同样的事情，但也可以绑定到 item_scraped signal 并在处理安排的请求时减少总数（可能还会删除项目，具体取决于蜘蛛）。

这是一个示例扩展，它跟踪每个命名蜘蛛排队的请求总数：

from collections import defaultdict
from scrapy import signals
from scrapy.exceptions import NotConfigured

class QueueTotal:
"""scrapy extension to track the number of requests that have been queued."""

def __init__(self):
    self.totals = defaultdict(int)
    self.items_scraped = defaultdict(int)

@classmethod
def from_crawler(cls, crawler):
    # first check if the extension should be enabled and raise
    # NotConfigured otherwise
    if not crawler.settings.getbool("QUEUETOTAL_ENABLED"):
        raise NotConfigured

    # instantiate the extension object
    ext = cls()
    # connect the extension object to signals
    crawler.signals.connect(ext.request_scheduled, signal=signals.request_scheduled)
    crawler.signals.connect(ext.request_dropped, signal=signals.request_dropped)

    # return the extension object
    return ext

def request_scheduled(self, request, spider):
    # increase total when new requests are scheduled
    self.totals[spider.name] += 1

def request_dropped(self, request, spider):
    # decrease total when requests are dropped
    self.totals[spider.name] -= 1

【讨论】：