【问题标题】:scraping multiple pages with scrapy用scrapy抓取多个页面
【发布时间】:2014-05-27 19:46:03
【问题描述】:

我正在尝试使用 scrapy 来抓取一个包含多页信息的网站。

我的代码是:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from tcgplayer1.items import Tcgplayer1Item


class MySpider(BaseSpider):
    name = "tcg"
    allowed_domains = ["http://www.tcgplayer.com/"]
    start_urls = ["http://store.tcgplayer.com/magic/journey-into-nyx?PageNumber=1"]

    def parse(self, response):
        hxs = Selector(response)
        titles = hxs.xpath("//div[@class='magicCard']")
        for title in titles:
            item = Tcgplayer1Item()
            item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]

            vendor = title.xpath(".//tr[@class='vendor ']")
            item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
            item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
            item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
            item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
            item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
            yield item

我正在尝试抓取所有页面,直到它到达页面的末尾...有时页面会比其他页面多,因此很难确切地说页码在哪里结束。

【问题讨论】:

    标签: python web-scraping scrapy scrapy-spider


    【解决方案1】:

    这个想法是增加pageNumber,直到找不到titles。如果页面上没有titles - 抛出CloseSpider 异常以停止蜘蛛:

    from scrapy.spider import BaseSpider
    from scrapy.selector import Selector
    from scrapy.exceptions import CloseSpider
    from scrapy.http import Request
    from tcgplayer1.items import Tcgplayer1Item
    
    
    URL = "http://store.tcgplayer.com/magic/journey-into-nyx?pageNumber=%d"
    
    class MySpider(BaseSpider):
        name = "tcg"
        allowed_domains = ["tcgplayer.com"]
        start_urls = [URL % 1]
    
        def __init__(self):
            self.page_number = 1
    
        def parse(self, response):
            print self.page_number
            print "----------"
    
            sel = Selector(response)
            titles = sel.xpath("//div[@class='magicCard']")
            if not titles:
                raise CloseSpider('No more pages')
    
            for title in titles:
                item = Tcgplayer1Item()
                item["cardname"] = title.xpath(".//li[@class='cardName']/a/text()").extract()[0]
    
                vendor = title.xpath(".//tr[@class='vendor ']")
                item["price"] = vendor.xpath("normalize-space(.//td[@class='price']/text())").extract()
                item["quantity"] = vendor.xpath("normalize-space(.//td[@class='quantity']/text())").extract()
                item["shipping"] = vendor.xpath("normalize-space(.//span[@class='shippingAmount']/text())").extract()
                item["condition"] = vendor.xpath("normalize-space(.//td[@class='condition']/a/text())").extract()
                item["vendors"] = vendor.xpath("normalize-space(.//td[@class='seller']/a/text())").extract()
                yield item
    
            self.page_number += 1
            yield Request(URL % self.page_number)
    

    这个特殊的蜘蛛会抛出所有 8 页数据,然后停止。

    希望对您有所帮助。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-12-13
      • 2018-07-28
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多