使用 python lxml 从 IMDB 返回空列表答案

【问题标题】：empty list is returned from IMDB using python lxml使用 python lxml 从 IMDB 返回空列表
【发布时间】：2016-05-08 05:12:02
【问题描述】：

我正在尝试使用 LXML 从 IMDB 获取前 250 部电影列表，但它返回的空列表可以告诉我我犯了什么错误。

from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.findall('.//table[2]//a')

电影列表为空 []

【问题讨论】：

标签： python web-scraping lxml imdb

【解决方案1】：

我的猜测是您使用了错误的 XPath 进行解析，使用 Firebug ，电影表的正确 xpath 是

/html/body/div[1]/div/div[4]/div[3]/div/div[1]/div/span/div/div/div[2]/table/tbody

这将返回一个包含所有电影数据的表。

您需要更多处理才能获取每部电影的信息。

我还想建议使用 requests 库进行 HTTP 查询

【讨论】：

【解决方案2】：

当我使用 firepath 在浏览器中测试时，您的 XPath 与链接页面中的任何元素都不对应（返回“无匹配节点”）。

这是对我有用的一种方法：

from lxml.html import parse
tree = parse('http://www.imdb.com/chart/top')
movies = tree.xpath("//table[contains(@class, 'chart')]//td[@class='titleColumn']/a/text()")
for movie in movies:
    print movie

最好使用 xpath() 方法，它提供了对 XPath 1.0 表达式的全面支持。上面使用的 XPath 参数的简要说明如下：

//table[contains(@class, 'chart')] ：找到table 元素，在HTML 文档中的任意位置，其中class 属性包含文本"chart"
//td[@class='titleColumn'] ：然后找到td 元素，在上述table 中的任何位置，其中class 属性值等于"titleColumn"
/a/text() ：然后从这样的td，找到子元素a 并返回其文本内容

上述sn-p输出的部分内容：

The Shawshank Redemption
The Godfather
The Godfather: Part II
The Dark Knight
Pulp Fiction
.....

【讨论】：