雅虎的 HTML 抓取工具！财务使用 lxml 和请求返回错误值答案

【问题标题】：HTML Scraper for Yahoo! Finance returning wrong value using lxml and requests雅虎的 HTML 抓取工具！财务使用 lxml 和请求返回错误值
【发布时间】：2019-02-28 01:13:07
【问题描述】：

from lxml import html
import requests

page = requests.get('https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch')
tree = html.fromstring(page.content)

peRatio = tree.xpath('//span[@class="Trsdu(0.3s) "] [@data-reactid="92"]/text()')
peRatio

如果我写了 [@data-reactid="92"]，上面的输出是预期的结果，它给了我 2,075.74。有谁知道为什么会这样？我期望得到 81.48，这可以在 AMZN 股票的源代码中看到。

编辑：我发现了一些奇怪的东西。我在 HTML 脚本中看到的内容一直是 26。所以当我使用 [@data-reactid="66"] 时，我得到了我想要的东西。同样，118-92 = 26。这延续到其他几个案例。任何想法为什么会这样？

【问题讨论】：

看起来像一个错误。你可以省略这个类，只做 data-reactid ，它给出了错误的答案。您对 html 的阅读是正确的。

标签： python html request lxml

【解决方案1】：

看起来是一个错误？ data-reactid 值按升序排序，与相应 span 的文本值不匹配。因此，跨度的文本值按文档顺序打印，而 data-reactid 值被排序，然后它们匹配不正确。 IE 15 返回 1,636.40，实际上是 41。

<span class="Trsdu(0.3s) " data-reactid="41">1,636.40</span>

我正在检查我的旧代码，看看我们是否可以修复您的 xpath 以使其正常工作。这是我看到问题的方式。如果我找到合适的解决方案，我会在几个小时内更新，或者如果其他人有答案，则将其删除。

print(tree.xpath('//span[@class="Trsdu(0.3s) "]/text()'))
print(tree.xpath('//span[@class="Trsdu(0.3s) "]/@data-reactid'))

['1,636.40', '1,628.18', '1,639.00 x 900', '1,640.25 x 900', '3,148,824', '6,293,333', '806.108B', '1.71', '81.48', '20.14', 'N/A', '2,075.74']
['15', '20', '25', '30', '43', '48', '56', '61', '66', '71', '87', '92']

【讨论】：