是否可以在scrapy中执行此代码模式？答案

【问题标题】：Is it possible to do this code pattern in scrapy?是否可以在scrapy中执行此代码模式？
【发布时间】：2014-12-25 19:25:36
【问题描述】：

使用scrapy，我想先从一些页面收集url，然后解析找到的每个url并生成item。

例如，代码是这样的：

def parse(self, response):
    # collect urls first
    urls = self.collect_urls(response)

    # parse urls found
    for url in urls:
        self.parse_url(url) # will yield Item inside


def collect_urls(reponse):
    urls = response.meta.get('urls')
    if urls is None:
        urls = set()

    # do some logic of collecting urls from response into urls set
    # ...

    if is_still_has_data(response):
        # continue collecting urls in other page
        yield scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'}, 
            meta={'urls': urls}, callback=self.collect_urls)
    else:
        return urls     # error here

问题是我无法在具有yield 的函数内返回对象。

然后我将urls 设为类属性/成员，如下所示：

urls = set()

def parse(self, response):
    # collect urls first
    yield self.collect_urls(response)

    # parse urls found
    for url in urls:
        self.parse_url(url) # will yield Item inside


def collect_urls(reponse):
    # do some logic of collecting urls from response into urls set
    # ...

    if is_still_has_data(response):
        # continue collecting urls in other page
        return scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'}, 
            callback=self.collect_urls)

这段代码的问题，在调用yield self.collect_urls(response)之后，它会直接继续到for url in urls:部分，而不是等待collect_urls函数完成。如果我删除yield，collect_urls 函数将只被调用一次，FormRequest 中的回调不起作用。似乎只有当 FormRequest 产生时回调才有效。

我知道可以将for url in urls: 部分移到collect_urls 函数中，但我想知道是否可以在scrapy 中实现我想要的代码模式？

【问题讨论】：

标签： python web-scraping scrapy

【解决方案1】：

当你有一个产生东西的函数时，你基本上把它变成了Python generator，你不能再真正返回任何东西了。

但是，即使在产生请求或项目后无法返回项目列表，如果您有想要返回的序列，您也可以遍历它并产生：

def some_callback(self, response):
    # ... yield something here

    requests = get_next_requests_list(response)

    # can't return requests list, so we iterate and yield:
    for req in requests:
        yield req

此外，Scrapy 只会跟随请求并收集由回调产生的项目。所以，如果你想从另一个回调中触发一个回调，你也必须产生调用它的结果：

def some_callback(self, response):
    # ... do stuff here, yields a few items or requests

    for rr in another_callback(response):
        yield rr

我希望这有助于解决您的问题。

【讨论】：

谢谢，所以我想这是不可能的。

【解决方案2】：

经过一些尝试，我认为不可能执行此代码模式，因为请求的回调无法将控制权返回给原始请求调用者/屈服者。

我可以做的一个解决方案是我必须链接回调，直到找不到 url，然后解析找到的每个 url：

def parse(self, response):
    urls = response.meta.get('urls')
    if urls is None:
        urls = set()

    # do some logic of collecting urls from response into urls set
    # ...

    if is_still_has_data(response):
        # continue collecting urls in other page
        return scrapy.FormRequest(response.url, formdata={'dummy':'dummy1'}, 
            meta={'urls': urls}, callback=self.parse)
    else:
        return self.do_loop_urls(urls)


def do_loop_urls(self, urls):
    # parse urls found
    for url in urls:
        yield self.parse_url(url) # will yield Item inside

假设有3页，流程图片是这样的：

解析 -> 解析 -> 解析 -> do_loop_urls

【讨论】：

你知道，最好学会使用一个工具，而不是试图将它改造成你现在的思维方式。 Scrapy 以更好地支持异步请求的方式处理请求。如果你真的想要伪同步风格的代码，你可以使用scrapy-inline-requests。