使用 XPath 在使用 python 的 <td> 单元格中获取文本答案

【问题标题】：Using XPath to get text within a <td> cell using python使用 XPath 在使用 python 的 <td> 单元格中获取文本
【发布时间】：2018-07-30 02:09:25
【问题描述】：

我目前正在学习如何使用 XPath 从 HTML 文档中提取信息。我正在使用 python，并且在获取网页标题之类的值时没有问题，但是当我尝试获取表格中特定单元格的文本时，我只是返回了一个空值。

这是我的代码，我使用 chrome 复制了要从中获取值的表格单元格的 XPath。

from lxml import html
import requests

page = requests.get('https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities')
tree = html.fromstring(page.content)

#This will get the cell text:
location = tree.xpath('//*[@id="mw-content-text"]/div/table[1]/tbody/tr[1]/td[3]/text()')

print('Location: ', location)

【问题讨论】：

标签： python html xpath web web-crawler

【解决方案1】：

您不应在 XPath 表达式中使用 tbody 标记，因为它可能会被开发人员忽略并在页面呈现时由浏览器添加。您可以尝试以下 XPath 来获取所需的值：

location = tree.xpath('//*[@id="mw-content-text"]/div/table[1]//tr[not(parent::thead)]/td[3]/text()')

输出是

Location:  ['Europe', 'Europe', 'North America', 'Europe', 'Europe', 'Europe', '
Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'North America', 'North America
', 'Europe', 'Europe', 'Asia', '\nEurope', 'Asia', '\nEurope', 'Europe', 'Europe
', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Oceania', '\nEurope', 'Nor
th America', 'Europe', 'Europe', 'Asia', 'Europe', 'North America', 'Asia', 'Eur
ope', 'Europe', 'North America', 'North America', 'Europe', 'Europe', 'North Ame
rica', 'North America', 'Asia', 'Europe', 'Europe', 'Europe', 'North America', '
Asia', 'Oceania', 'North America', 'Europe', 'Europe', 'Asia', 'North America',
'Europe', 'Europe', 'South America', 'Asia', 'Asia', 'Asia', 'Europe', 'North Am
erica']

【讨论】：

您好，感谢您提供的 tbody 信息。完全有道理：我在上面的回答。 :) 我想我一直只是用 // 来获取后代，所以我从来没有注意到它。

【解决方案2】：

随便看看。

尝试： tree.xpath('//*[@id="mw-content-text"]/div/table[1]/tr/td[3]/text()')

我认为在 Chrome 中呈现的网页上的内容与请求返回的内容有些不同。（即不需要 textbody，并且指定 tr[1] 会产生空结果。仅供参考。您提供的 xpath 已签出并在 chrome 中运行良好。

下面也看安德森的回答，但基本上，tbody 可以通过浏览器添加，最好不要在路径中使用它

【讨论】：

非常感谢两位的回复。如果我只想返回一行的特定列，而不是该列中的所有值，您介意给我看一个 tree.xpath 的示例吗？例如，如果我只想从第一行打印“Europe”。再次感谢！！
当然没问题。尝试tree.xpath('//*[@id="mw-content-text"]/div/table[1]//tr[1]/td[3]/text()') 注意：// 是 self 或 decendant 的快捷方式，然后我们进入第二个 tr 索引，然后进入 td