使用 XPath 从表中获取元素答案

【问题标题】：Get elements from table using XPath使用 XPath 从表中获取元素
【发布时间】：2019-07-08 10:24:51
【问题描述】：

我正在尝试从该网站获取信息 https://www.realtypro.co.za/property_detail.php?ref=1736

我有这张桌子，我想从中获取卧室的数量

<div class="panel panel-primary">
    <div class="panel-heading">Property Details</div>
        <div class="panel-body">
            <table width="100%" cellpadding="0" cellspacing="0" border="0" class="table table-striped table-condensed table-tweak">
                <tbody><tr>
                    <td class="xh-highlight">3</td><td style="width: 140px" class="">Bedrooms</td>

                </tr>
                <tr>
                    <td>Bathrooms</td>
                    <td>3</td>
                </tr>

我正在使用这个 xpath 表达式：

bedrooms = response.xpath("//div[@class='panel panel-primary']/div[@class='panel-body']/table[@class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]/text()").extract_first()

但是，我只得到“无”作为输出。

我尝试了几种组合，但我只得到 None 作为输出。关于我做错了什么有什么建议吗？

提前致谢！

【问题讨论】：

你需要第二个tr。
@Utkanos 即使我将它转换为 tr[2]/td[2] 我仍然得到 None 作为输出
在你的问题中你说你需要浴室的数量；你的意思是Bedrooms 的数量是3 的输出吗？
另一种方法.xpath("//*[starts-with(@class,'table')]//tr[contains(.,'Bedrooms')]/td/text()").get()
您在上面粘贴的元素与该链接中可用的元素的顺序不同。在链接中，卧室的数量在后面，但在您提供的 html 数字中，在前面。

标签： html parsing xpath web-scraping web-crawler

【解决方案1】：

我会使用 bs4 4.7.1。您可以在其中使用:contains 搜索具有文本"Bedrooms" 的td 单元格，然后获取相邻的兄弟td。您可以为 is None 添加测试以进行错误处理。不如长 xpath 脆弱。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.realtypro.co.za/property_detail.php?ref=1736')
soup = bs(r.content, 'lxml')
print(int(soup.select_one('td:contains(Bedrooms) + td').text)

如果位置是固定的，你可以使用

.table-tweak td + td

【讨论】：

【解决方案2】：

试试这个，让我知道它是否有效：

import lxml.html

response = [your code above]
beds = lxml.html.fromstring(response)

bedrooms = beds.xpath("//div[@class='panel panel-primary']/div[@class='panel-body']/table[@class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()")
bedrooms

输出：

['3']

编辑：

或者可能：

for bed in beds:
     num_rooms = bed.xpath("//div[@class='panel panel-primary']/div[@class='panel-body']/table[@class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()")
     print(num_rooms)

【讨论】：

当我尝试这个时，我得到'list'对象没有属性'xpath'作为错误@JackFleeting
@saraherceg - 我没有收到此错误，但请查看编辑并告诉我是否可行。
我认为我的响应有问题，因为它进入了床上，现在抛出错误 'expected string' response = [response.xpath("//div[@class='panel panel- primary']/div[@class='panel-body']/table[@class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]/text() ").extract_first()] 床 = lxml.html.fromstring(response) 卧室 = bed.xpath("//div[@class='panel panel-primary']/div[@class='panel-body'] /table[@class='table table-striped table-condensed table-tweak']/tbody/tr[1]/td[2]//preceding-sibling::*/text()") @JackFleeting
@saraherceg - 是的，似乎有问题。如果不是机密信息，您可能需要发布链接，以便人们自己下载数据，然后查看输出。
谢谢，我刚刚把网站的链接贴出来了！ @JackFleeting