【问题标题】:Scrapy: Custom Call-Backs do not workScrapy:自定义回调不起作用
【发布时间】:2016-07-24 18:04:10
【问题描述】:

我不知道为什么我的蜘蛛不起作用!我是 no 的意思是程序员,所以请善待!哈哈

背景: 我正在尝试使用“Scrapy”来抓取与在 Indigo 上找到的书籍有关的信息。

问题: 我的代码没有输入我的任何自定义回调......它似乎只有在我使用“解析”作为回调时才有效。

如果我将代码的“规则”部分中的回调从“parse_books”更改为“parse”,我列出所有链接的方法就可以正常工作并打印出所有链接我感兴趣的链接。但是,该方法中的回调(指向“parse_books”)永远不会被调用!奇怪的是,如果我将“parse”方法重命名为其他名称(即 ->“testmethod”),然后将“parse_books”方法重命名为“parse”,我 scrape 信息的方法into items 工作得很好!

我想要达到的目标: 我想要做的就是进入一个页面,比如说“畅销书”,导航到所有项目的相应项目级页面,并抓取所有与书籍相关的信息。我似乎让这两件事都独立工作:/

代码!

import scrapy
import json
import urllib
from scrapy.http import Request
from urllib import urlencode
import re
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import urlparse



from TEST20160709.items import IndigoItem
from TEST20160709.items import SecondaryItem



item = IndigoItem()
scrapedItem = SecondaryItem()

class IndigoSpider(CrawlSpider):

    protocol='https://'
    name = "site"
    allowed_domains = [
    "chapters.indigo.ca/en-ca/Books",
    "chapters.indigo.ca/en-ca/Store/Availability/"
    ]

    start_urls = [
         'https://www.chapters.indigo.ca/en-ca/books/bestsellers/',
    ]

    #extractor = SgmlLinkExtractor()s

    rules = (
    Rule(LinkExtractor(), follow = True),
    Rule(LinkExtractor(), callback = "parse_books", follow = True),
    )



    def getInventory (self, bookID):
        params ={
       'pid' : bookID,
       'catalog' : 'books'
        }
        yield Request(
            url="https://www.chapters.indigo.ca/en-ca/Store/Availability/?" + urlencode(params),
            dont_filter = True,
            callback = self.parseInventory
        )



    def parseInventory(self,response):
        dataInventory = json.loads(response.body)

        for entry in dataInventory ['Data']:
            scrapedItem['storeID'] = entry['ID']
            scrapedItem['storeType'] = entry['StoreType']
            scrapedItem['storeName'] = entry['Name']
            scrapedItem['storeAddress'] = entry['Address']
            scrapedItem['storeCity'] = entry['City']
            scrapedItem['storePostalCode'] = entry['PostalCode']
            scrapedItem['storeProvince'] = entry['Province']
            scrapedItem['storePhone'] = entry['Phone']
            scrapedItem['storeQuantity'] = entry['QTY']
            scrapedItem['storeQuantityMessage'] = entry['QTYMsg']
            scrapedItem['storeHours'] = entry['StoreHours']
            scrapedItem['storeStockAvailibility'] = entry['HasRetailStock']
            scrapedItem['storeExclusivity'] = entry['InStoreExlusive']

            yield scrapedItem



    def parse (self, response):
        #GET ALL PAGE LINKS
        all_page_links = response.xpath('//ul/li/a/@href').extract()
        for relative_link in all_page_links:
            absolute_link = urlparse.urljoin(self.protocol+"www.chapters.indigo.ca",relative_link.strip())
            absolute_link = absolute_link.split("?ref=",1)[0]
            request = scrapy.Request(absolute_link, callback=self.parse_books)
            print "FULL link: "+absolute_link

            yield Request(absolute_link, callback=self.parse_books)





    def parse_books (self, response):

        for sel in response.xpath('//form[@id="aspnetForm"]/main[@id="main"]'):
            #XML/HTTP/CSS ITEMS
            item['title']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h1[@id="product-title"][@class][@data-auto-id]/text()').extract())
            item['authors']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/h2[@class="major-contributor"]/a[contains(@class, "byLink")][@href]/text()').extract())
            item['productSpecs']= map(unicode.strip, sel.xpath('div[@class="content-wrapper"]/div[@class="product-details"]/div[@class="col-2"]/section[@id="ProductDetails"][@class][@role][@aria-labelledby]/p[@class="product-specs"]/text()').extract())
            item['instoreAvailability']= map(unicode.strip, sel.xpath('//span[@class="stockAvailable-mesg negative"][@data-auto-id]/text()').extract())
            item['onlinePrice']= map(unicode.strip, sel.xpath('//span[@id][@class="nonmemberprice__specialprice"]/text()').extract())
            item['listPrice']= map(unicode.strip, sel.xpath('//del/text()').extract())

            aboutBookTemp = map(unicode.strip, sel.xpath('//div[@class="read-more"]/p/text()').extract())
            item['aboutBook']= [aboutBookTemp]

            #Retrieve ISBN Identifier and extract numeric data
            ISBN_parse = map(unicode.strip, sel.xpath('(//div[@class="isbn-info"]/p[2])[1]/text()').extract())
            item['ISBN13']= [elem[11:] for elem in ISBN_parse]
            bookIdentifier = str(item['ISBN13'])
            bookIdentifier = re.sub("[^0-9]", "", bookIdentifier)


            print "THIS IS THE IDENTIFIER:" + bookIdentifier

            if bookIdentifier:
                yield self.getInventory(str(bookIdentifier))

            yield item

【问题讨论】:

  • 你的方法似乎不在课堂上。你能格式化代码吗?

标签: python callback scrapy web-crawler


【解决方案1】:

我注意到的第一个问题是您的allowed_domains 类属性已损坏。它应该包含(因此是名称)。

在您的情况下正确的值是:

allowed_domains = [
    "chapters.indigo.ca",  # subdomain.domain.top_level_domain
]

如果你检查你的蜘蛛日志,你会看到:

DEBUG: Filtered offsite request to 'www.chapters.indigo.ca'

这不应该发生。

【讨论】:

  • 谢谢!它似乎正在工作! “parseInventory”方法似乎没有被触发,但你肯定已经拯救了这一天。非常感谢!
  • 没问题,如果您发现它解决了您的问题,请随时接受问题:)
猜你喜欢
  • 2015-05-16
  • 1970-01-01
  • 1970-01-01
  • 2011-05-07
  • 2011-12-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-11-22
相关资源
最近更新 更多