【问题标题】:How to scrape link within site using scrapy如何使用scrapy抓取网站内的链接
【发布时间】:2020-09-17 15:41:32
【问题描述】:

我正在尝试使用 scrapy 从网站中抓取,以及网站内容中的链接。但是,当我这样做时,我在 parse:
TypeError: 'NoneType' object does not support item assignment

这是我的代码:

class PostsSpider(scrapy.Spider):
    name = "posts"
    start_urls = ['https://www.nba.com/teams/bucks']
    allowed_domains = ['nba.com']

    def parse(self, response):
        for post in response.css('.nba-player-index section section'):
            playerPage = response.urljoin(post.css('a').attrib['href'])
            item = yield scrapy.Request(playerPage, callback=self.helper)
            item['number'] = post.css('span.nba-player-trending-item__number::text').get(),
            yield item

    def helper(self, response):
       print("--->"+response.css("title").get())
       item = Item()
       item['title'] = response.css("title::text").get()
       yield item

class Item(scrapy.Item):
    # define the fields for your item here like:
    number = scrapy.Field()
    title = scrapy.Field()
    ppg = scrapy.Field()

【问题讨论】:

  • 请提供整个回溯
  • 除非您打算将该方法设为coroutine,否则item = yield scrapy.Request(playerPage, callback=self.helper) 行可能是错误的。或者,您需要使用send(...) 为第一行item = yield ... 传递一个值。请参阅链接的问题。此外,显示您用于调用这些方法/执行脚本的代码。

标签: python scrapy


【解决方案1】:

您可以做的是将number 数据传递给助手,而不是这样做。 像这样的:

def parse(self, response):
    for post in response.css('.nba-player-index section section'):
        playerPage = response.urljoin(post.css('a').attrib['href'])
        meta = response.meta.copy()
        meta['number'] = post.css('span.nba-player-trending-item__number::text').get()
        yield scrapy.Request(playerPage, callback=self.helper, meta=meta)


def helper(self, response):
       # here you will get `number` in response.meta['number'] that you can yield further.
       item = Item()
       item['number'] = response.meta.get('number)
       yield item

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-05-09
    • 2020-10-12
    • 2017-12-07
    • 2019-04-24
    • 2020-03-15
    相关资源
    最近更新 更多