Scrapy/Python：在执行代码之前等待一个 yield 请求完成（图像 dl）答案

【问题标题】：Scrapy/Python: wait for a yield request to finish before executing code (images dl)Scrapy/Python：在执行代码之前等待一个 yield 请求完成（图像 dl）
【发布时间】：2022-01-05 03:55:17
【问题描述】：

我有一个项目，其目的是抓取一本书的所有章节 (parse) 然后为每一章下载图像 (parse_chapter) 并创建一个pdf 的章节（create_pdf）。

这是我的代码（最小的工作示例）：

def parse(self, response):        
        chapters = response.xpath('/html/body/div[1]/div/div[1]/div/div[4]/div/ul/li[1]/h5/a/@href')

        for chapter in chapters:
            yield scrapy.Request(chapter.get(), callback=self.parse_chapter)
    
def parse_chapter(self, response):
        logging.debug("parse_chapter")

        image_urls = response.xpath('/html/body/div[1]/div[3]/div/div[2]/div[2]/a/img/@src').get()

        yield {
            'image_urls' : image_urls
        }
        
        self.create_pdf()
        # once the pdf is created => delete all the pictures


def create_pdf(self):
        files = os.listdir(os.getcwd() + '/tmp/')
        if len(files) > 0:
            ...

另外，我已经修改了settings.py和pipelines.py

问题是： 在下载所有图片之前调用函数 create_pdf。有没有办法在执行 create_pdf 之前等待 yield 请求结束？

【问题讨论】：

标签： python scrapy yield

【解决方案1】：

我的猜测是您可以将callback 用于相同的功能，因此无需创建parse_chapter，您只需重复parse 下的代码即可。您可以运行scrapy.follow 来跟踪解析下的链接，而不是运行scrapy.Request。

某事：

def parse(self, response):        
        chapters = response.xpath('/html/body/div[1]/div/div[1]/div/div[4]/div/ul/li[1]/h5/a/@href')

        for chapter in chapters:
            yield scrapy.follow(chapter.get(), callback=self.parse)
    
        image_urls = response.xpath('/html/body/div[1]/div[3]/div/div[2]/div[2]/a/img/@src').get()
        yield {
            'image_urls' : image_urls
        }
        self.create_pdf()
        # once the pdf is created => delete all the pictures

def create_pdf(self):
        files = os.listdir(os.getcwd() + '/tmp/')
        if len(files) > 0:
            ...

【讨论】：