使用 xpath 提取日期答案

【问题标题】：Using xpath to extract date使用 xpath 提取日期
【发布时间】：2016-08-31 14:35:37
【问题描述】：

我正在尝试从以下评论中的网页“07/18/16”中提取日期。我不清楚 xpath 的语法，你如何只获取日期？

#<p>Opened <a class="timeline" href="/trac3/timeline?from=2016-07-    
#18T14%3A46%3A43-04%3A00&amp;precision=second" title="See timeline at   
#07/18/16 14:46:43">6 weeks ago</a></p>

from lxml import html
import requests

page = requests.get(webpage)
tree = html.fromstring(page.content)

openDate = tree.xpath('//Opened/text()')

print 'Open Date: ', openDate

【问题讨论】：

在 # 上拆分标题一次，然后提取第二个元素并再次拆分第一个元素，使用 /a/@title 代替文本调用来获取标题
实际上 Opened 是 p 中的文本，所以你的 xpath 什么也找不到

标签： python regex xpath python-requests lxml

【解决方案1】：

像这样？

import re
from lxml import html

data = """<p>Opened <a class="timeline" href="/trac3/timeline?from=2016-07-18T14%3A46%3A43-04%3A00&amp;precision=second" title="See timeline at 07/18/16 14:46:43">6 weeks ago</a></p>"""

tree = html.fromstring(data)
try:
    href = tree.xpath("//a[@class='timeline']/@href")[0]
    openDate = re.search(r'from=(\d+-\d+-\d+)', href).group(1)
    print('Open Date: ', openDate)
    # Open Date:  2016-07-18
except:
    print("Something went wrong")

这首先获取@href 属性，然后使用正则表达式对其进行分析。

再次阅读问题后，您可能更愿意寻找 title 属性：

try:
    href = tree.xpath("//a[@class='timeline']/@title")[0]
    openDate = re.search(r'\d+/\d+/\d+', href).group(0)
    print('Open Date: ', openDate)
    # Open Date:  07/18/16
except:
    print("Something went wrong")

【讨论】：

我唯一的猜测是 OP 想要“07/18/16”（@title）而不是“2016-07-18”（@href）。在我看来，不值得投反对票。这个概念仍然存在 +1。
@DanielHaley：谢谢，已经更新了答案以反映标题属性。

【解决方案2】：

这是仅使用 xpath 1.0 的一种方法：

substring-before(substring-after(normalize-space(//a[contains(concat(' ',normalize-space(@class),' '),' timeline ')]/@title),'See timeline at '), ' ')

contains(concat(' ',normalize-space(@class),' '),' timeline ') 可能看起来有点矫枉过正，但会考虑到类属性中存在“时间线”以外的类的可能性。

XPath 测试：http://www.xpathtester.com/xpath/7805b0601b1468ea17209127e14fa470

lxml 示例

from lxml import html

page = """<p>Opened <a class="timeline" href="/trac3/timeline?from=2016-07-18T14%3A46%3A43-04%3A00&amp;precision=second" title="See timeline at 07/18/16 14:46:43">6 weeks ago</a></p>"""
tree = html.fromstring(page)

try:
    openDate = tree.xpath("substring-before(substring-after(normalize-space(//a[contains(concat(' ',normalize-space(@class),' '),' timeline ')]/@title),'See timeline at '), ' ')")
    print 'Open Date: ', openDate
    #Open Date: 07/18/16
except:
    print("Something went wrong")

【讨论】：

【解决方案3】：

XPath 通过匹配 XML 结构化文档中的元素来工作。

您的 XPath 将失败，因为您所说的是在整个文档 ("//") 中搜索任何称为 "Opened" 的元素（即<Opened/>）并返回它们的内部文本（"text()"）。

假设你的 HTML 是一致的，你真正想要做的是抓取日期的锚标题的内容，如下所示：

//p[contains(text(),'Opened')]/a[@class='timeline']/@title

这将在整个文档中搜索属于“时间线”类且位于包含“已打开”一词的段落中的任何锚点，并返回其“标题”属性的内容。

注意我说的是“任何锚点”；你的结果将是一个匹配的标题列表，所以如果你有多个匹配项，你需要决定做什么。

获得标题后，您需要在 python 中进行一些字符串切片以检索日期部分。

我假设它只是 XPath 您正在苦苦挣扎，所以我遗漏了任何 python 示例。我推荐这个站点作为 XPath 的一个很好的起点：http://dh.obdurodon.org/introduction-xpath.xhtml

【讨论】：

【解决方案4】：

你不能这样做。 Xpath 直接选择标签，而不是其中的字段。所以“//p/a[text()]”返回所有<a class="timeline" href="/trac3/timeline?from=2016-07-18T14%3A46%3A43-04%3A00&amp;precision=second" title="See timeline at 07/18/16 14:46:43">6 weeks ago</a> 或者您可以按条件选择，例如“//p/a[text() = “6 周前”]” 所以得到这个<a></a>标签，然后用python解析它

【讨论】：

你是说xpath不能选择属性？