【发布时间】:2021-04-10 00:39:46
【问题描述】:
我对如何准确地返回scrapy spider 的输出有点困惑,这样我就可以在另一个函数或全局范围内使用它。在下面的代码中,我尝试像通常对函数一样返回 res 变量,但它似乎不像 Scrapy 那样工作,而是为我列表中的每个 url 返回以下错误:return request, item, or None, got 'str'
感谢您抽出宝贵时间研究此问题!
import scrapy
from scrapy.crawler import CrawlerProcess
import logging
#disable logging for scrapy - by default verbose as hell
logging.getLogger('scrapy').propagate = False
#create the spider
class feedSpider(scrapy.Spider):
#the spider needs a name
name="scraper"
# define the sources we're about to crawl
def start_requests(self):
urls = [feed for feed in feeds]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
# parse the response
def parse(self, response):
# Select the first headline from each RSS feed
res = response.xpath('//item/title/text()').get()
return res
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(feedSpider)
# the script will block here until the crawling is finished
process.start()
【问题讨论】:
-
你可以只返回 response.xpath('//item/title/text()') 并在这个函数之外调用 .get() 吗?
-
刚刚做了,但错误仍然存在 Spider 必须返回请求、项目或 None,得到 'Selector
标签: scrapy web-crawler