【问题标题】:Callback Function never called using Scrapy从未使用 Scrapy 调用过回调函数
【发布时间】:2016-01-13 21:43:26
【问题描述】:

我是 Scrapy 和 python 的新手。我花了几个小时尝试调试并寻找有用的响应,但我仍然卡住了。我正在尝试从 www.pro-football-reference.com 中提取数据。这是我现在的代码

import scrapy

from nfl_predictor.items import NflPredictorItem

class NflSpider(scrapy.Spider):
   name = "nfl2"
   allowed_domains = ["http://www.pro-football-reference.com/"]
   start_url = [
    "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"
   ]

    def parse(self, response):
        print "parse"
        for href in response.xpath('// [@id="page_content"]/div[1]/table/tr/td/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_game_content)

    def parse_game_content(self, response):
        print "parse_game_content"
        items = []
        for sel in response.xpath('//table[@id = "team_stats"]/tr'):
            item = NflPredictorItem()
            item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract()
            item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract()
        items.append(item)
    return items

我使用 parse 命令进行调试,并使用此命令

scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm"

我得到以下输出

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[]

# Requests  -----------------------------------------------------------------
[<GET http://www.pro-football-reference.com/years/2015/games.htm>,
 <GET http://www.nfl.com/scores/2015/REG1>,
 <GET http://www.pro-football-reference.com/boxscores/201509130buf.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130chi.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130crd.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130dal.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130den.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130htx.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130jax.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130nyj.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130rai.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130ram.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130sdg.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130tam.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509130was.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509140atl.htm>,
 <GET http://www.pro-football-reference.com/boxscores/201509140sfo.htm>]

为什么它正在记录我想要的链接的请求,但它从来没有进入 parse_game_content 函数来实际抓取数据?我还测试了 parse_game_content 函数作为 parse 函数,以确保它正在抓取正确的数据并且在这种情况下它可以正常工作。

感谢您的帮助!

【问题讨论】:

  • 你确定你已经导入了所有的库?

标签: python callback scrapy scrapy-spider


【解决方案1】:

默认情况下,解析命令获取给定的 URL 并使用处理它的蜘蛛解析它,使用 --callback 选项传递的方法,如果没有给出,则解析。在你的情况下,它只解析解析函数。改命令给--callbacklike:

scrapy parse --spider=nfl2 "http://www.pro-football-reference.com/boxscores/201509100nwe.htm" --callback=parse_game_content

另外,最好如下更改 parse_game_content 函数

    def parse_game_content(self, response):
        print "parse_game_content"
        for sel in response.xpath('//table[@id="team_stats"]/tr'):
            item = NflPredictorItem()
            item['away_stats'] = sel.xpath('td[@align = "center"][1]/text()').extract()
            item['home_stats'] = sel.xpath('td[@align = "center"][2]/text()').extract()
            yield item

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-04-16
    • 2013-10-26
    • 2014-03-20
    • 1970-01-01
    相关资源
    最近更新 更多