Scrapy爬虫不提取数据答案

【问题标题】：Scrapy crawl not extracting dataScrapy爬虫不提取数据
【发布时间】：2020-02-29 01:46:54
【问题描述】：

我正在尝试从BestBuy 中抓取评论，如果代码在shell 上逐行执行而不是通过script 执行，则它提取得很好。怎么了？

class BestbuybotSpider(scrapy.Spider):
    name = 'bestbuybot'
    allowed_domains = ['https://www.bestbuy.com/site/amazon-echo-dot-3rd-gen-smart-speaker-with-alexa-charcoal/6287974.p?skuId=6287974']
    start_urls = ['http://https://www.bestbuy.com/site/amazon-echo-dot-3rd-gen-smart-speaker-with-alexa-charcoal/6287974.p?skuId=6287974/']


def parse(self, response):
        #Extracting the content using css selectors
        rating = response.css("div.c-ratings-reviews-v2.v-small p::text").extract()
        title = response.css(".review-title.c-section-title.heading-5.v-fw-medium  ::text").extract()

        #Give the extracted content row wise
        for item in zip(rating,title):
            #create a dictionary to store the scraped info
            scraped_info = {
                'rating' : item[0],
                'title' : item[1],
            }

            #yield or give the scraped info to scrapy
            yield scraped_info

Console Image

【问题讨论】：

标签： web-scraping scrapy scrapy-pipeline

【解决方案1】：

您的代码存在一些问题，即

allowed_domains 应该是域而不是 URL。
您的起始 URL 的 URL 方案有问题，即它的开头有 'http://https:

如您所见，scrapy 蜘蛛重定向到您图像中的 finder.cox.net，因此蜘蛛永远不会到达该页面，而是显示一个国家选择页面，这是一个重定向。

您应该首先尝试使用确切的页面位置修复您的起始 URL，并且蜘蛛似乎正在工作。

【讨论】：