【问题标题】:Websites getting crawled but not scraped Scrapy网站被抓取但未被抓取 Scrapy
【发布时间】:2019-07-03 10:36:54
【问题描述】:

我一直在抓取这个网站并尝试存储属性,虽然有些属性确实被抓取,但有些只是被抓取而不是抓取:

class CapeWaterfrontSpider(scrapy.Spider):
    name = "cape_waterfront"
    start_urls = ['https://www.capewaterfrontestates.co.za/template/Properties.vm/listingtype/SALES']

    def parse(self, response):
        for prop in response.css('div.col-sm-6.col-md-12.grid-sizer.grid-item'):

            link = prop.css('div.property-image a::attr(href)').get()

            bedrooms = prop.css('div.property-details li.bedrooms::text').getall()
            bathrooms = prop.css('div.property-details li.bathrooms::text').getall()
            gar = prop.css('div.property-details li.garages::text').getall()

            if len(bedrooms) == 0:
                bedrooms.append(None)
            else:
                bedrooms = bedrooms[1].split()
            if len(bathrooms) == 0:
                bathrooms.append(None)
            else:
                bathrooms = bathrooms[1].split()
            if len(gar) == 0:
                gar.append(None)
            else:
                gar = gar[1].split()

            yield scrapy.Request(
                link,
                meta={'item': {
                    'agency': self.name,
                    'url': link,
                    'title': ' '.join(prop.css('div.property-details p.intro::text').get().split()),
                    'price': ''.join(prop.css('div.property-details p.price::text').get().split()),
                    'bedrooms': str(bedrooms),
                    'bathroom':  str(bathrooms),
                    'garages': str(gar)
                }},
                callback=self.get_loc,
            )

        next_page = response.css('p.form-control-static.pagination-link a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

任何建议如何使这项工作? 提前非常感谢您

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    您定义选择器的方式很容易出错。此外,很少有故障的选择器根本不工作。下一页的链接也不起作用。它只进入第 1 页,然后退出。最后,我不知道 next_sibling 在 css 选择器中的任何用法,所以我不得不以某种尴尬的方式挖掘出下一个兄弟的东西。

    class CapeWaterfrontSpider(scrapy.Spider):
        name = "cape_waterfront"
        start_urls = ['https://www.capewaterfrontestates.co.za/template/Properties.vm/listingtype/SALES']
    
        def parse(self, response):
    
            for prop in response.css('.grid-item'):
                link = prop.css('.property-image a::attr(href)').get()
    
                bedrooms = [elem.strip() for elem in prop.css(".bedrooms::text").getall()]
                bedrooms = bedrooms[-2] if len(bedrooms)>=1 else None
    
                bathrooms = [elem.strip() for elem in prop.css(".bathrooms::text").getall()]
                bathrooms = bathrooms[-2] if len(bathrooms)>=1 else None
    
                gar = [elem.strip() for elem in prop.css(".garages::text").getall()]
                gar = gar[-2] if len(gar)>=1 else None
    
                yield scrapy.Request(
                    link,
                    meta={'item': {
                        'agency': self.name,
                        'url': link,
                        'bedrooms': bedrooms,
                        'bathroom':  bathrooms,
                        'garages': gar
                    }},
                    callback=self.get_loc,
                )
    
            next_page = response.css('.pagination-link a.next::attr(href)').get()
            if next_page:
                next_page = response.urljoin(next_page)
                yield scrapy.Request(next_page, callback=self.parse)
    
        def get_loc(self,response):
            items = response.meta['item']
            print(items)
    

    如果你想追求更简洁的方法来获得这三个项目,我认为xpath 是你想要坚持的:

    for prop in response.css('.grid-item'):
        link = prop.css('.property-image a::attr(href)').get()
        bedrooms = prop.xpath("normalize-space(.//*[contains(@class,'bedrooms')]/label/following::text())").get()
        bathrooms = prop.xpath("normalize-space(.//*[contains(@class,'bathrooms')]/label/following::text())").get()
        gar = prop.xpath("normalize-space(.//*[contains(@class,'garages')]/label/following::text())").get()
    

    为了简洁起见,我已经排除了两三个字段,我想你可以管理它们。

    【讨论】:

    • 我在你的脚本中包含了 css 选择器和 xpaths 来挖掘这些项目@saraherceg。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-09-02
    • 2018-07-12
    • 2013-05-09
    • 2020-10-12
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多