【问题标题】:Issue with scraping href in Python using Scrapy Spider使用 Scrapy Spider 在 Python 中抓取 href 的问题
【发布时间】:2019-08-21 00:56:51
【问题描述】:

我目前正在尝试从 craiglist 页面上的标题中抓取 href。我正在使用 python scrapy,并且一直遇到问题

我已经尝试了几件事,我不明白出了什么问题。

import scrapy

class MySpider(scrapy.Spider):
    name = "HondaUrl"
    start_urls = {'https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date'}

    def parse(self,response):
        sel = Selector(response)
        for href in sel.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract_first():
            print(href)

没有显示任何错误消息,我只得到零个结果。

【问题讨论】:

    标签: python python-3.x web-scraping scrapy


    【解决方案1】:

    我稍微修正了你的代码以转储 hrefs(删除了 Selector 并将 extract_first 替换为 extract):

    class MySpider(scrapy.Spider):
        name = "HondaUrl"
        start_urls = ['https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date']
    
        def parse(self, response):
            for href in response.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract():
                print('HREF:', href)
    

    输出:

    HREF: https://chicago.craigslist.org/chc/cto/d/chicago-2010-honda-cr-lx/6960935447.html
    HREF: https://chicago.craigslist.org/chc/ctd/d/midlothian-2010-honda-cr-ex-4wd-5-speed/6960826946.html
    HREF: https://chicago.craigslist.org/chc/ctd/d/chicago-2014-honda-cr-crv-lx-sport/6960791760.html
    HREF: https://chicago.craigslist.org/chc/ctd/d/chicago-2016-honda-cr-crv-lx-sport/6960737848.html
    HREF: https://chicago.craigslist.org/nch/cto/d/wilmette-honda-crv-2007/6960699975.html
    HREF: https://chicago.craigslist.org/chc/ctd/d/westmont-2014-honda-cr-ex-skuel-suv/6960650987.html
    ...
    

    更新 - 将结果转储到 json 文件:

    class HrefItem(scrapy.Item):
        href = scrapy.Field()
    
    class MySpider(scrapy.Spider):
        name = "HondaUrl"
        start_urls = ['https://chicago.craigslist.org/search/cta?auto_make_model=honda%20cr-v&hints=mileage&max_auto_miles=120000&min_auto_miles=1000&min_auto_year=2004&sort=date']
    
        def parse(self, response):
            for href in response.xpath('//div[@class="content"]//p[@class="result-info"]/a/@href').extract():
                # print('HREF:', href)
                item = HrefItem()
                item['href'] = href
                yield item
    

    对应的文档是here

    【讨论】:

    • 好的,效果很好。谢谢你。我运行了代码,但无法将其保存到 .json 文件中。我跑了代码scrapy crawl HondaUrl -o UrlResults.json。代码运行,但我无法将其保存到 .json 文件中。
    猜你喜欢
    • 2019-11-10
    • 2020-10-03
    • 1970-01-01
    • 2013-02-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多