如何从scrapy spider返回数据？答案

【问题标题】：How to return data from scrapy spider?如何从scrapy spider返回数据？
【发布时间】：2021-04-10 00:39:46
【问题描述】：

我对如何准确地返回scrapy spider 的输出有点困惑，这样我就可以在另一个函数或全局范围内使用它。在下面的代码中，我尝试像通常对函数一样返回 res 变量，但它似乎不像 Scrapy 那样工作，而是为我列表中的每个 url 返回以下错误：return request, item, or None, got 'str'

感谢您抽出宝贵时间研究此问题！

import scrapy
from scrapy.crawler import CrawlerProcess
import logging

#disable logging for scrapy - by default verbose as hell
logging.getLogger('scrapy').propagate = False

#create the spider
class feedSpider(scrapy.Spider):

    #the spider needs a name
    name="scraper"

    # define the sources we're about to crawl
    def start_requests(self):
        urls = [feed for feed in feeds]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # parse the response
    def parse(self, response):

        # Select the first headline from each RSS feed
        res = response.xpath('//item/title/text()').get()
        return res

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})


process.crawl(feedSpider)
 # the script will block here until the crawling is finished
process.start()

【问题讨论】：

你可以只返回 response.xpath('//item/title/text()') 并在这个函数之外调用 .get() 吗？
刚刚做了，但错误仍然存在 Spider 必须返回请求、项目或 None，得到 'Selector

标签： scrapy web-crawler

【解决方案1】：

以下错误，而不是我列表中的每个 url：return request、item 或 None，得到 'str'

嗯，就是这样，不是吗:-)

您真正想要的是item。相信我！选择您喜欢的方式来定义项目。让履带车前进，生产物品。不要让它浪费时间在其他事情上。

将项目数据的后处理放在其他地方。 item loaders 中有输入和输出处理器，item pipelines 的概念非常酷。

但同样，你想要开始的是一个项目！

祝你好运，玩得开心！

【讨论】：

谢谢，这是我第一次使用 Scrapy，所以我有点不知道它是如何工作的！您能否提供一个示例，说明我如何在这种情况下定义和使用项目？

【解决方案2】：

解决这些问题的更简单方法是返回字典而不是文字数据

data = { } 
data['title'] = response.xpath('//item/title/text()').get()
yield data

如果您想以非阻塞方式从 pass 函数返回多个内容，yield 将是理想的选择。如果您只有一件事要返回，yield / return 将无关紧要。

【讨论】：