使用 Scrapy Spider 在 Python 中抓取 href 的问题答案

【问题标题】：Issue with scraping href in Python using Scrapy Spider使用 Scrapy Spider 在 Python 中抓取 href 的问题
【发布时间】：2019-08-21 00:56:51
【问题描述】：

我目前正在尝试从 craiglist 页面上的标题中抓取 href。我正在使用 python scrapy，并且一直遇到问题

我已经尝试了几件事，我不明白出了什么问题。

import scrapy

class MySpider(scrapy.Spider):
    name = "HondaUrl"
    start_urls = {'https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date'}

    def parse(self,response):
        sel = Selector(response)
        for href in sel.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract_first():
            print(href)

没有显示任何错误消息，我只得到零个结果。

【问题讨论】：

标签： python python-3.x web-scraping scrapy

【解决方案1】：

我稍微修正了你的代码以转储 hrefs（删除了 Selector 并将 extract_first 替换为 extract）：

class MySpider(scrapy.Spider):
    name = "HondaUrl"
    start_urls = ['https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date']

    def parse(self, response):
        for href in response.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract():
            print('HREF:', href)

输出：

HREF: https://chicago.craigslist.org/chc/cto/d/chicago-2010-honda-cr-lx/6960935447.html
HREF: https://chicago.craigslist.org/chc/ctd/d/midlothian-2010-honda-cr-ex-4wd-5-speed/6960826946.html
HREF: https://chicago.craigslist.org/chc/ctd/d/chicago-2014-honda-cr-crv-lx-sport/6960791760.html
HREF: https://chicago.craigslist.org/chc/ctd/d/chicago-2016-honda-cr-crv-lx-sport/6960737848.html
HREF: https://chicago.craigslist.org/nch/cto/d/wilmette-honda-crv-2007/6960699975.html
HREF: https://chicago.craigslist.org/chc/ctd/d/westmont-2014-honda-cr-ex-skuel-suv/6960650987.html
...

更新 - 将结果转储到 json 文件：

class HrefItem(scrapy.Item):
    href = scrapy.Field()

class MySpider(scrapy.Spider):
    name = "HondaUrl"
    start_urls = ['https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date']

    def parse(self, response):
        for href in response.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract():
            # print('HREF:', href)
            item = HrefItem()
            item['href'] = href
            yield item

对应的文档是here。

【讨论】：

好的，效果很好。谢谢你。我运行了代码，但无法将其保存到 .json 文件中。我跑了代码scrapy crawl HondaUrl -o UrlResults.json。代码运行，但我无法将其保存到 .json 文件中。