【问题标题】:scrapy unable to extract some data from websitescrapy 无法从网站中提取一些数据
【发布时间】:2015-07-23 22:26:57
【问题描述】:

我正在使用 scrapy 抓取页面,我能够获取所有可见文本的简单内容。但是,有些文本对爬虫来说是不可见的,最终显示为空格。

例如,查看页面来源可以让我看到这些字段:

https://www.dropbox.com/s/f056mffmuah6uu4/Screenshot%202015-07-23%2018.23.32.png?dl=0

我多次尝试通过 xpath 和 css 访问该字段,但每次尝试后都无法获取这些字段。

当我尝试类似:

response.xpath('//text()').extract()

我根本无法在文本转储中看到这些字段。

有人知道为什么这些字段对scrapy不可见吗?网址是:https://www.buzzbuzzhome.com/uc/units/houses/sapphire

【问题讨论】:

    标签: python web-scraping web-crawler scrapy scrapy-spider


    【解决方案1】:

    在您的蜘蛛中,您需要向https://www.buzzbuzzhome.com/bbhAjax/Development/UnitPriceHistory 端点发出一个额外的 XHR POST 请求,以获取提供必要标头和 POST 参数的价格历史记录:

    import json
    import scrapy
    
    
    class BuzzSpider(scrapy.Spider):
        name = 'buzzbuzzhome'
        allowed_domains = ['buzzbuzzhome.com']
        start_urls = ['https://www.buzzbuzzhome.com/uc/units/houses/sapphire']
    
        def parse(self, response):
            unit_id = response.xpath("//div[@id = 'unitDetails']/@data-unit-id").extract()[0]
            development_url = "uc"
            new_relic_id = response.xpath("//script[contains(., 'xpid')]").re(r'xpid:"(.*?)"')
    
            params = {"developmentUrl": development_url, "unitID": unit_id}
            yield scrapy.Request("https://www.buzzbuzzhome.com/bbhAjax/Development/UnitPriceHistory",
                                 method="POST",
                                 body=json.dumps(params),
                                 callback=self.parse_history,
                                 headers={
                                     "Accept": "*/*",
                                     "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36",
                                     "X-Requested-With": "XMLHttpRequest",
                                     "X-NewRelic-ID": new_relic_id,
                                     "Origin": "https://www.buzzbuzzhome.com",
                                     "Host": "www.buzzbuzzhome.com",
                                     'Content-Type': 'application/json; charset=UTF-8'
                                 })
    
        def parse_history(self, response):
            for row in response.css("div.row"):
                title = row.xpath(".//div[@class='content-title']/text()").extract()[0].strip()
                text = row.xpath(".//div[@class='content-text']/text()").extract()[0].strip()
    
                print title, text
    

    打印:

    05/25/2015 Unit listed as Sold
    12/18/2014 Unit listed as For Sale
    11/24/2014 Unit price increased  by 1.54% to $461,990
    11/04/2014 Unit price increased  by 6.81% to $454,990
    10/02/2014 Unit price increased  by 4.67% to $425,990
    01/22/2014 Unit price increased  by 2.52% to $406,990
    12/06/2013 Unit listed as For Sale at $396,990
    

    【讨论】:

    • 非常感谢!你能帮我理解为什么需要第二个请求吗?再次感谢
    • @rmaka 当然!打开浏览器开发工具、网络选项卡,看看在浏览器中构建页面涉及多少。对于这个特定页面,发送了几个 XHR 请求。其中之一是获取我们在 Scrapy spider 中模拟的历史记录。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-05-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多