【问题标题】:Scraping different products information using scrapy使用 scrapy 抓取不同的产品信息
【发布时间】:2016-11-15 23:03:22
【问题描述】:

以下是我用来抓取产品信息的代码。一个页面上有很多产品。我将它们全部刮掉,然后转到下一页。问题在于,scrapy 只选择页面上的第一个产品,而不是遍历页面上的所有产品。我哪里错了?

import re
import time
import sys
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
import parsedatetime
from datetime import datetime
from airline_sentiment.items import *
from airline_sentiment.spiders.crawlerhelper import *

class TripAdvisorRestaurantBaseSpider(BaseSpider):
    name = "shoebuy"

    allowed_domains = ["shoebuy.com"]
    base_uri = "http://www.shoebuy.com"
    start_urls = [
                 base_uri + "/womens-leather-boots/category_2493?cm_sp=cat-_-d_womensboots_tiles_b1_leather-_-092216"
                 ]


    def parse(self, response):

         sel = Selector(response)

        snode_airline = sel.xpath('//*[starts-with(@class, "pt_grid")]/div[starts-with(@class, "pt_product\")]')

        for snode_restaurant in snode_airline:
            tripadvisor_item =  AirlineSentimentItem()

            tripadvisor_item['url'] = self.base_uri + clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_info")]/a/@href'))

            tripadvisor_item['name'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_info")]/a/span[@class="pt_title"]/text()'))
            tripadvisor_item['price'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_prices")]/span[@class="pt_price"]/text()'))
            tripadvisor_item['discount'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_prices")]/div[@class="pt_discount"]/span[@class="pt_percent_off"]/text()'))
            tripadvisor_item['orig_price'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_prices")]/div[@class="pt_discount"]/span[@class="pt_price_orig"]/text()'))
            tripadvisor_item['stars'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//*[@class="bv-rating-ratio"]/span/span[3]/text()'))
            tripadvisor_item['reviews'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "bv-inline-rating-container")]/dl/dd[2]/span/text()'))

            yield Request(url=tripadvisor_item['url'], meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_fetch_review)


        next_page_url = clean_parsed_string(get_parsed_string(sel, '//div[@class="paging"]/a[@class="next"]/@href'))
        if next_page_url and len(next_page_url) > 0:
            yield Request(url=self.base_uri + next_page_url, meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_next_page)

    def parse_next_page(self, response):
        sel = Selector(response)

        snode_airline = sel.xpath('//*[starts-with(@class, "pt_grid")]/div[starts-with(@class, "pt_product")]')

        for snode_restaurant in snode_airline:

            tripadvisor_item =  AirlineSentimentItem()

            tripadvisor_item['url'] = self.base_uri + clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_info")]/a/@href'))
            tripadvisor_item['name'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_info")]/a/span[@class="pt_title"]/text()'))
            tripadvisor_item['price'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_prices")]/span[@class="pt_price"]/text()'))
            tripadvisor_item['discount'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_prices")]/div[@class="pt_discount"]/span[@class="pt_percent_off"]/text()'))
            tripadvisor_item['orig_price'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_prices")]/div[@class="pt_discount"]/span[@class="pt_price_orig"]/text()'))
            tripadvisor_item['stars'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//*[@class="bv-rating-ratio"]/span/span[3]/text()'))
            tripadvisor_item['reviews'] = clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "bv-inline-rating-container")]/dl/dd[2]/span/text()'))

            yield Request(url=tripadvisor_item['url'], meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_fetch_review)

        next_page_url = clean_parsed_string(get_parsed_string(sel, '//div[@class="paging"]/a[@class="next"]/@href'))
        if next_page_url and len(next_page_url) > 0:
            yield Request(url=self.base_uri + next_page_url, meta={'tripadvisor_item': tripadvisor_item}, callback=self.parse_next_page)

    def parse_fetch_review(self, response):

        tripadvisor_item = response.meta['tripadvisor_item']
        sel = Selector(response)

        snode_reviews = sel.xpath('//*[starts-with(@class, "product_info_wrapper")]')

        for snode_review in snode_reviews:

            tripadvisor_item['img'] = self.base_uri + clean_parsed_string(get_parsed_string(snode_review, '//div[starts-with(@class,"large_thumb")]/img/@src'))

            tripadvisor_item['desc'] = clean_parsed_string(get_parsed_string(snode_review, '//*[starts-with(@class,"product_information")]/div[1]/span/text()'))

            tripadvisor_item['brand'] = clean_parsed_string(get_parsed_string(snode_review, '//div[starts-with(@class,"seo_module")]/h3/text()'))

        yield tripadvisor_item

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    这是故障线:

            tripadvisor_item['url'] = self.base_uri + clean_parsed_string(get_parsed_string(snode_restaurant, '//div[starts-with(@class, "pt_info")]/a/@href'))
    

    xpath 应该以 . 开头,如 .//div 以指示相对节点:

    './/div[starts-with(@class, "pt_info")]/a/@href'
    

    由于您没有使 xpath 相对于您的节点(使用 '.' 符号),您总是将页面上的第一个产品链接作为每个项目的 url。现在,scrapy 具有自动重复 url 过滤器,所以发生的情况是,您检索评论的所有请求后来都被过滤掉了,您最终只得到了第一个项目。

    Tl;dr:只需在您的相对 xpath 中的 // 之前添加一个 .

    【讨论】:

    • 效果很好。谢谢。此外,我没有获得星级和评论价值(我没有获得)。我不确定为什么我给出的 xpath 不起作用。如果我能得到一些解决方案,那就太好了。
    • @NeelShah 发生这种情况是因为星星和评论是由一些 javascript 调用 (ajax) 生成的,而 scrapy 不执行任何 javascript。也许您应该为此打开一个新问题,因为它与当前问题无关。
    猜你喜欢
    • 2022-08-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-07-27
    相关资源
    最近更新 更多