Python Scrapy：返回抓取的 URL 列表答案

【问题标题】：Python Scrapy: Return list of URLs scrapedPython Scrapy：返回抓取的 URL 列表
【发布时间】：2020-04-29 20:43:39
【问题描述】：

我正在使用 scrapy 从单个域中抓取所有链接。我正在关注域上的所有链接，但将所有链接保存在域外。以下刮板工作正常，但我无法从刮板内访问成员变量，因为我使用CrawlerProcess 运行它。

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    on_domain_urls = set()
    off_domain_urls = set()

    def parse(self, response):
        links = response.xpath('//a/@href')
        for link in links:
            url = link.get()
            if 'example.com' in url and url not in self.on_domain_urls:
                print('On domain links found: {}'.format(
                    len(self.on_domain_urls)))
                self.on_domain_urls.add(url)
                yield scrapy.Request(url, callback=self.parse)
            elif url not in self.off_domain_urls:
                print('Offf domain links found: {}'.format(
                    len(self.on_domain_urls)))
                self.off_domain_urls.add(url)

process = CrawlerProcess()
process.crawl(GoodOnYouSpider)
process.start()
# Need access to off_domain_links

如何访问 off_domain_links？我可能可以将其移至全局范围，但这似乎是 hack。我也可以附加到文件，但如果可能的话，我想避免文件 I/O。有没有更好的方法来返回这样的聚合数据？

【问题讨论】：

标签： python python-3.x web-scraping scrapy

【解决方案1】：

你检查过 Itempipeline 吗？我认为您必须在这种情况下使用它并决定需要对变量执行什么操作。

见： https://docs.scrapy.org/en/latest/topics/item-pipeline.html

【讨论】：