使用链式 xpath 表达式提取父节点答案

【问题标题】：Using chained xpath expressions to extract parent node使用链式 xpath 表达式提取父节点
【发布时间】：2020-07-04 00:43:04
【问题描述】：

我想从以下 HTML 中提取键名和值。

<ul>
    <li><span class="label">Key A:</span> Value A
    </li>
</ul>
<td>
    <span class="label">Key B:</span> Value B
</td>

我的策略是直接放大 span.label 以获取密钥，然后缩小以从父文本中提取值。但是，使用以下 xpath 选择器，我无法成功提取父文本，即使 //span[@class="label"]/parent::*/text() 在 Google Chrome 中生成了正确的匹配项。

        for field in section.css('span.label'):
            key = field.xpath('./text()').get().strip()
            value = field.xpath('./parent::*/text()').get().strip()
            section_fields[key]=value

我在使用链式表达式时犯了错误吗？

【问题讨论】：

标签： xpath css-selectors

【解决方案1】：

试试这个方法：

import lxml.html as lh

label = """[your html above]"""

doc = lh.fromstring(label)
for l in doc.xpath('//span[@class="label"]'):
    print(l.text.strip(),l.tail.strip())

输出：

Key A: Value A
Key B: Value B

【讨论】：

【解决方案2】：

好吧，你应该修复你的 XPath ：

./parent::*/text()[normalize-space()]

忽略空白节点。或者你可以更直接地使用：

./following::text()[1]

一段代码：

data = """<ul>
    <li><span class="label">Key A:</span> Value A
    </li>
</ul>
<td>
    <span class="label">Key B:</span> Value B
</td>"""

import lxml.html
tree = lxml.html.fromstring(data)

key=[]
value=[]
for field in tree.xpath('//span'):
    key.append(field.xpath('./text()')[0].strip())
    value.append(field.xpath('./parent::*/text()[normalize-space()]')[0].strip())

table=(list(zip(key,value)))

for a,b in table:
    print(a,b)

输出：

Key A: Value A
Key B: Value B

【讨论】：