【问题标题】:Scrapy can't crawl all pagesScrapy无法爬取所有页面
【发布时间】:2020-02-05 10:15:27
【问题描述】:

我正在尝试用 scrapy 抓取一个电子商务页面,代码如下所示

class HugobossSpider(scrapy.Spider):
    name = 'hugoboss'
    allowed_domains = ['hugoboss.com/de/herren-schuhe/?sz=60&start=0']
    start_urls = ['https://hugoboss.com/de/herren-schuhe/?sz=60&start=0']

    def parse(self, response):
    # The main method of the spider. It scrapes the URL(s) specified in the
    # 'start_url' argument above. The content of the scraped URL is passed on
    # as the 'response' object.

        nextpageurl = response.xpath("//a[@title='Weiter']/@href")

        for item in self.scrape(response):
            yield item

        if nextpageurl:
            path = nextpageurl.extract_first()
            nextpage = response.urljoin(path)
            print("Found url: {}".format(nextpage))
            yield Request(nextpage, callback=self.parse)

    def parse(self, response):
    #Extracting the content using css selectors
        url = response.xpath('//div/@data-mouseoverimage').extract()
        product_title = response.xpath('//*[@class="product-      tile__productInfoWrapper product-tile__productInfoWrapper--is-small font__subline"]/text()').extract()
        price = response.css('.product-tile__offer .price-sales::text').getall()
    #Give the extracted content row wise
        for item in zip(url,product_title,price):
        #create a dictionary to store the scraped info
            item = {
              'URL' : item[0],
              'Product Name' : item[1].replace("\n", '').replace("von", ""),
              'Price' : item[2]
            }

        #yield or give the scraped info to scrapy
            yield item

问题是代码正在提取当前页面的信息,但无法提取所有页面的信息。 有人可以帮忙吗?

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    你已经定义了两次函数def parse() 重命名第二个(可能是def extract()),然后重试。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-03-05
      • 1970-01-01
      • 2018-02-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-10-10
      相关资源
      最近更新 更多