Scrapy 只产生最后一个 url 项答案

【问题标题】：Scrapy yielding only last url itemScrapy 只产生最后一个 url 项
【发布时间】：2020-09-25 13:26:36
【问题描述】：

我正在编写一个简单的网络爬虫来浏览亚马逊页面并获取书籍详细信息。为此，我使用 Selenium 来获取 JS 生成的内容。它遍历 ASIN 列表，但只获取最后一个 ASIN 标题和书籍信息，并重复它的次数与我拥有 ASIN 的次数一样多。我不明白为什么yield 对每个网址都有效。以下是源代码：

class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['amazon.com']
    # list of ASIN to append to append to URL
    list_url = ['B075QL36RW', 'B01ISNIKES', 'B06XG27KV2', 'B00IWGRPRK', 'B00NS42GFW', 'B0178USZ88', 'B00KWGOBQQ', 'B07FXXM638']

    def start_requests(self):
        self.driver = webdriver.Chrome('/path/to/chromedriver')
        for url in self.list_url:
            link = f'https://www.amazon.com/dp/{url}'
            self.driver.get(link)
            yield scrapy.Request(link, self.parse_book)


    def parse_book(self, response):
        sel = Selector(text=self.driver.page_source)

        title_raw = sel.xpath('//*[@id="productTitle"]/text()').extract()
        info_raw = sel.xpath('//*[@id="bookDescription_feature_div"]/noscript').extract()

        title = ' '.join(''.join(title_raw).split())
        info = ' '.join(''.join(info_raw).split())
        cleanr = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
        cleantext = re.sub(cleanr, '', info)

        yield {
            'title': title,
            'info': cleantext
        }

【问题讨论】：

标签： python selenium-webdriver scrapy

【解决方案1】：

以下是重构代码如何帮助解决问题：


list_of_urls = ['B075QL36RW', 'B01ISNIKES', 'B06XG27KV2', 'B00IWGRPRK', 'B00NS42GFW']
asin = iter(list_of_urls)

def start_requests(self):
        self.driver = webdriver.Chrome('path/to/driver')
        self.driver.get('https://amazon.com/dp/B06xt7gkb1')

        sel = Selector(text=self.driver.page_source)
        url = 'https://amazon.com/dp/B06xt7gkb1'
        yield Request(url, callback=self.parse_book)

        while True:
            next_url = f'https://amazon.com/dp/{next(self.asin)}'
            self.driver.get(next_url)
            sel = Selector(text=self.driver.page_source)
            yield Request(next_url, callback=self.parse_book)

    def parse_book(self, response):
        ...

【讨论】：

【解决方案2】：

在parse_book() 中，您使用的是yield，所以它是一个生成器。 start_requests() 已经是一个生成器。因此，当您遍历start_requests() 的结果时，您实际上得到的是来自parse_book 的生成器的可迭代（例如，列表）。现在，在您遍历这些生成器之前，它们不会被调用或评估。当您最终评估它时，可能是在从start_requests 获得所有书籍之后；在parse_book 的最后一个循环之后。此时，parse_book 中的self.driver.page_source 是最后一个，所以'title' 和'cleantext' 是最后一次迭代的结果，所以它们只有最后一本书的价值，所以这就是你看到的每次。

如果你将parse_book() 中的yield 替换为return，那么'self.driver.page_source'、'title'和'cleantext'等会用它们当前的循环变量进行评估，你会得到不同的结果。

【讨论】：

尝试使用return，没有帮助。但是，我设法通过稍微重组代码使其工作。这次start_requests()解析url，调用parse_books()，把数据写下来，继续下一个url。