如何使用 XPath 提取 href？答案

【问题标题】：How can I extract the href with XPath?如何使用 XPath 提取 href？
【发布时间】：2018-08-31 12:15:25
【问题描述】：

HTML 结构是这样的：

<div class="image">
  <a target="_top" href="someurl">
    <img class="_verticallyaligned" src="cdn.translte" alt="">
  </a>
  <button class="dui-button -icon" data-shop-id="343170" data-id="14145140">
    <i class="dui-icon -favorite"></i>
  </button>
</div>

提取文本的代码：

buyers = doc.xpath("//div[@class='image']/a[0]/text()")

输出是：

[]

我做错了什么？

【问题讨论】：

您正在寻找类似chrome.google.com/webstore/detail/xpath-helper/… 或chrome.google.com/webstore/detail/xpath-finder/… 的其他浏览器的类似结果

标签： python xpath scrapy lxml

【解决方案1】：

您的 XPath 不正确，因为 XPath 中的索引（与大多数编程语言不同）从 1 开始，而不是从 0 开始！

所以正确的 XPath 应该是

//div[@class='image']/a[1]/@href

请注意，使用a[1] 代替a[0]

还应该使用text() 来提取文本节点。如果你需要提取特定属性的值，你应该使用@attribute_name语法或attribute::attribute_name

【讨论】：

【解决方案2】：

使用@href 获取href 属性的值。

buyers = doc.xpath("//div[@class='image']/a[0]/@href")

【讨论】：

谢谢，我怎样才能在 img 类中获取 src？
您可以通过像这样修改它来使用相同的 xpath 表达式。 //div[@class='image']/a[0]/img/@src
再次感谢，但事情是这样的，我得到了一个'所有测试数据'这种格式，仍然输出[]
格式是否包含空格，有问题吗？

【解决方案3】：

使用attrib['href'] 应该会有所帮助。

s = """<div class="image">
  <a target="_top" href="someurl">
    <img class="_verticallyaligned" src="cdn.translte" alt="">
  </a>
  <button class="dui-button -icon" data-shop-id="343170" data-id="14145140">
                                    <i class="dui-icon -favorite"></i>
                                </button>
</div>"""

from lxml import etree
tree = etree.HTML(s)
r = tree.xpath("//div[@class='image']/a")
print(r[0].attrib['href'])

输出：

someurl

【讨论】：

【解决方案4】：

/text() 表示您正在该标签内获取文本，为了获取任何属性的值，请执行/@attribute，因此在您的情况下，请执行doc.xpath("//div[@class='image']/a[0]/@href")

【讨论】：