【问题标题】:Scrapy - XPath for Next PageScrapy - 下一页的 XPath
【发布时间】:2018-06-10 06:25:56
【问题描述】:

我在获取站点的“下一页”URL 的 XPath 时遇到了问题。

HTML如下:

<div class="pagingcont">

        <div class="right margintop" id="save_search_header_popup" style="width:550px;">
            <div class="left marginleft" style="padding-top:1px;">
                <div class="left save_search_env"><img src="/themes/LW1/refresh/images/envelope_icon.gif" alt="Save" />&nbsp;</div>
                <div class="left">
                    Save this search and receive email alerts of new listings
                    &nbsp;<input type="text" maxlength="100" value="Name this search" onfocus="doSavedSearchFocus(this,'Name this search');" style="width:120px;height:14px;color:Gray;"/>&nbsp;
                </div>
            </div>
            <div class="left save_search_btn" style="margin-right:10px;"><img class="pointer" src="/themes/LW1/refresh/images/btn_save.gif" alt="Save"  onclick="showPopup(document.getElementById('save_search_header_popup'), null, 'In order to be notified of new or updated properties, you need to be registered and signed in.');return false;"/></div>
        </div>
        <div class="left margintop marginleft" style="cursor:pointer;height:27px;" onclick="javascript:docompare(true);">
            <div class="left"><img src="//www.landwatch.com/themes/LW1/images/comparebtn_btm.gif" style="margin-bottom:0px;">&nbsp;&nbsp;</div>
            <div class="left active" style="margin-top:4px;">COMPARE</div>
        </div>
        <div class="clear topline"></div>

    <div class="clear margin">
        <b>Page &nbsp;</b>
        &nbsp;<span class="active" style="padding:3px 3px 3px 4px;border:solid 1px black;">1&nbsp;</span>&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=2">2</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=3">3</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=4">4</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=5">5</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=6">6</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=7">7</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=8">8</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=9">9</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=10">10</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=11">11</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=12">12</a>&nbsp;|&nbsp;<a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=13">13</a>&nbsp;| <a href="https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2c&pg=2">Next</a>
    </div>

(我要找的href是最右下角的,这里看不方便……)

我的 scrapy 尝试以下操作:

next_page_url = response.xpath("//div[@class='pagingcont']//span//a[text()='Next']/href")
    next_page_url = response.urljoin(next_page_url)

    for href in response.css('div.propName a::attr(href)'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_product_page)
    yield scrapy.Request(next_page_url, callback=self.parse)

但是每次,scrapy 都会给我第一页的结果,然后什么都没有。所以我认为它不能有效地找到下一页。那个 next_page_url 有什么问题?

【问题讨论】:

    标签: python html xpath scrapy


    【解决方案1】:

    您的 xpath 有两个问题:

    1. 它正在寻找不在您的数据中的&lt;span&gt;
    2. href是属性,不是节点,所以应该是@href

    下面的完整工作示例。

    from scrapy.spiders import Spider
    from scrapy import Request
    
    class LandSpider(Spider):
        name = 'myspider'
        start_urls = [
            'https://www.landwatch.com/default.aspx?ct=r&type=5,37;268,6843&=&px=2000000&r.PSIZ=500%2C&pg=1']
    
        def parse(self, response):
            next_page_url = response.xpath(
                "//div[@class='pagingcont']//a[text()='Next']/@href").extract_first()
    
            for href in response.css('div.propName a::attr(href)'):
                url = response.urljoin(href.extract())
                yield Request(url, callback=self.parse_product_page)
            yield Request(next_page_url, callback=self.parse)
    
        def parse_product_page(self, response):
            return response.xpath("//div[@class='detTitle']/text()").extract_first()
    

    结果:

    [
    {"title": "Lulaton, Brantley County, Coast, GA Land For Sale - 936 Acres"},
    {"title": "Oglethorpe County, GA Land For Sale - 515 Acres"},
    {"title": "Dawsonville, Lumpkin County, GA Land For Sale - 525 Acres"},
    {"title": "Wheeler County, GA Land For Sale - 594 Acres"},
    {"title": "Cedartown, Polk County, GA Land For Sale - 1185.65 Acres"},
    ...
    ]
    

    【讨论】:

    • 我们有它。非常非常感谢,jschnurr。
    【解决方案2】:

    首先,对于您显示的 html 示例,没有 span 作为 a 标记的父级,因此执行 //span//a 不会得到任何东西。所以也许你的 xpath 应该只是:

    "//div[@class='pagingcont']//a[text()='Next']/href"
    

    当然可以更好。

    现在你也没有得到你的 python 代码的值,这应该用.extract_first 来完成,所以你的第一个next_page_url 变量(你共享的代码的第一行)是Selector,实际上不是细绳。将其更改为:

    next_page_url = response.xpath("//div[@class='pagingcont']//a[text()='Next']/href").extract_first()
    

    【讨论】:

    • 非常感谢您的回复。不幸的是,它仍然只通过第一页。不移动到下一页。
    猜你喜欢
    • 1970-01-01
    • 2015-03-22
    • 1970-01-01
    • 2015-12-13
    • 1970-01-01
    • 2021-06-28
    • 1970-01-01
    • 1970-01-01
    • 2022-01-21
    相关资源
    最近更新 更多