【问题标题】:Python Scrapy keep getting same page link from next page buttonPython Scrapy不断从下一页按钮获取相同的页面链接
【发布时间】:2017-10-02 08:55:59
【问题描述】:

我正在尝试在 amazon.com 上搜索具有超过 800 条评论的产品链接,但我不断从下一页按钮获取相同的页面链接,它不断返回第 2 页,而我应该得到第 3 页,4 以此类推

我已经设置了一个 IF 条件,将 1,020 之类的审查字符串转换为整数,然后根据访问页面比较是否大于 800

这里是代码

# -*- coding: utf-8 -*-
import scrapy
from amazon.items import AmazonItem
from urlparse import urljoin


class AmazonspiderSpider(scrapy.Spider):
    name = "amazonspider"
    DOWNLOAD_DELAY = 1
    start_urls = ['https://www.amazon.com/s/ref=lp_165993011_nr_n_0?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A2514571011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011']


    def parse(self, response):


        SET_SELECTOR = '.a-carousel-card.acswidget-carousel__card'
        for attr in response.css(SET_SELECTOR):
            #print '\n\n', attr

            item = AmazonItem()

            review_selector = './/*[@class="acs_product-rating__review-count"]/text()'
            link_selector = './/*[@class="a-link-normal"]/@href'

            if attr.xpath(review_selector).extract_first():
                if int(''.join(attr.xpath(review_selector).extract_first().split(','))) >= 800:
                    url = urljoin(response.url, attr.xpath(link_selector).extract_first())
                    item['LINKS'] = url
                    if url:
                        yield scrapy.Request(url, callback=self.parse_link, meta={'item': item})  


            next_page = './/span[@class="pagnRA"]/a[@id="pagnNextLink"]/@href'
            next_page = response.xpath(next_page).extract_first()
            print '\n\n', urljoin(response.url, next_page)
            if next_page:
                yield scrapy.Request(
                    urljoin(response.url, next_page),
                    callback=self.parse
                )
    def parse_link(self, response):

        item = AmazonItem(response.meta['item'])

        catselector = '.cat-link ::text'
        defaultcatselector = '.nav-search-label ::text'
        cat = response.css(catselector).extract_first()
        if cat:
            item['CATAGORY'] = cat
        else:
            item['CATAGORY'] = response.css(defaultcatselector).extract_first()
        return item

这是我在递归调用解析函数之前打印下一页链接时的输出

这是页面的下一页选择器的屏幕截图 我哪里出错了?

【问题讨论】:

    标签: python html xpath css-selectors scrapy


    【解决方案1】:

    将下一页代码块移到循环之外。

    class AmazonspiderSpider(scrapy.Spider):
    name = "amazonspider"
    DOWNLOAD_DELAY = 1
    start_urls = ['https://www.amazon.com/s/ref=lp_165993011_nr_n_0?fst=as%3Aoff&rh=n%3A165793011%2Cn%3A%21165795011%2Cn%3A165993011%2Cn%3A2514571011&bbn=165993011&ie=UTF8&qid=1493778423&rnid=165993011']
    
    
    def parse(self, response):
    
    
        SET_SELECTOR = '.a-carousel-card.acswidget-carousel__card'
        for attr in response.css(SET_SELECTOR):
            #print '\n\n', attr
    
    
            review_selector = './/*[@class="acs_product-rating__review-count"]/text()'
            link_selector = './/*[@class="a-link-normal"]/@href'
    
            if attr.xpath(review_selector).extract_first():
                if int(''.join(attr.xpath(review_selector).extract_first().split(','))) >= 800:
                    url = urljoin(response.url, attr.xpath(link_selector).extract_first())
    
    
       next_page = './/span[@class="pagnRA"]/a[@id="pagnNextLink"]/@href'
       next_page = response.xpath(next_page).extract_first()
       print '\n\n', urljoin(response.url, next_page)
    
       if next_page:
           yield scrapy.Request(
               urljoin(response.url, next_page),
               callback=self.parse
           )
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-04-15
      • 2020-11-11
      • 2014-08-19
      • 2016-03-07
      • 2017-08-22
      • 1970-01-01
      • 2019-01-17
      相关资源
      最近更新 更多