【问题标题】:Python Scrapy nested pages only need items from innermost pagePython Scrapy嵌套页面只需要最内页的项目
【发布时间】:2017-05-03 07:59:37
【问题描述】:

我在一个有嵌套页面的网站上练习scrapy,我只需要抓取最里面页面的内容,但是有没有办法将解析函数中的数据传输到最里面的页面,到主解析函数,使用许多解析函数打开页面,但只从最后一个解析函数获取项目,并结转到主解析函数

这是我尝试过的

try:
    import scrapy
    from urlparse import urljoin

except ImportError:
    print "\nERROR IMPORTING THE NESSASARY LIBRARIES\n"



class CanadaSpider(scrapy.Spider):
    name = 'CananaSpider'
    start_urls = ['http://www.canada411.ca']


    #PAGE 1 OF THE NESTED WEBSITE GETTING LINK AND JOING WITH THE MAIN LINK AND VISITING THE PAGE
    def parse(self, response):
        SET_SELECTOR = '.c411AlphaLinks.c411NoPrint ul li'
        for PHONE in response.css(SET_SELECTOR):
            selector = 'a ::attr(href)'
            try:
                momo = urljoin('http://www.canada411.ca', PHONE.css(selector).extract_first())

                #PASSING A DICTIONARYAS THE ITEM
                pre  = {}
                post = scrapy.Request(momo, callback=self.parse_pre1, meta={'item': pre})
                yield pre
            except:
                pass   

#PAGE 2 OF THE NESTED WEBSITE


    def parse_pre1(self, response):

        #RETURNING THE SAME ITEM 
        item = response.meta["item"]
        SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'

        for PHONE in response.css(SET_SELECTOR):
            selector = 'a ::attr(href)'
            momo = urljoin('http://www.canada411.ca', PHONE.css(selector).extract_first())
            pre = scrapy.Request(momo, callback=self.parse_pre1, meta={'page_2': item})
            yield pre

    def parse_info(self, response):

        #HERE I AM SCRAPING THE DATA
        item = response.meta["page_2"]
        name = '.vcard__name'
        address = '.c411Address.vcard__address'
        ph = '.vcard.label'

        item['name'] = response.css(name).extract_first()
        item['address'] = response.css(address).extract_first()
        item['phoneno'] = response.css(ph).extract_first()
        return item 

我正在继承该项目我做错了什么?

【问题讨论】:

  • 你能详细说明一下吗,我是python新手
  • 你在parse_pre1函数中的回调不应该引用callback=self.parse_info而不是callback=self.parse_pre1

标签: python scrapy web-crawler scrapy-spider


【解决方案1】:

parse 中,你在post 的实例中产生pre,你也应该使用Scrapy.Item 类,而不是字典。

  def parse(self, response):
        SET_SELECTOR = '.c411AlphaLinks.c411NoPrint ul li'
        for PHONE in response.css(SET_SELECTOR):
            selector = 'a ::attr(href)'
            try:
                momo = urljoin('http://www.canada411.ca', PHONE.css(selector).extract_first())

                #PASSING A DICTIONARYAS THE ITEM
                pre  = {} # This should be an instance of Scrapy.Item  
                post = scrapy.Request(momo, callback=self.parse_pre1, meta={'item': pre})
                yield post
            except:
                pass   

parse_pre1 中你再次设置为回调parse_pre1,我认为你的意思是parse_info

def parse_pre1(self, response):

    #RETURNING THE SAME ITEM 
    item = response.meta["item"]
    SET_SELECTOR = '.clearfix.c411Column.c411Column3 ul li'

    for PHONE in response.css(SET_SELECTOR):
        selector = 'a ::attr(href)'
        momo = urljoin('http://www.canada411.ca', PHONE.css(selector).extract_first())
        pre = scrapy.Request(momo, callback=self.parse_info, meta={'page_2': item})
        yield pre

【讨论】:

    猜你喜欢
    • 2020-10-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-04-07
    • 2017-11-23
    • 2016-12-10
    相关资源
    最近更新 更多