使用 Python 在 Yahoo Finance 中抓取 Analysis 选项卡答案

【问题标题】：Scraping in Yahoo Finance the Analysis tab with Python使用 Python 在 Yahoo Finance 中抓取 Analysis 选项卡
【发布时间】：2020-06-08 15:13:24
【问题描述】：

我正在尝试从雅虎财经的“分析”选项卡中提取股票 BABA 的“未来 5 年（每年）”的价值：https://finance.yahoo.com/quote/BABA/analysis?p=BABA。（倒数第二行是 2.85%）。

我一直在尝试使用这些问题：

Scrape Yahoo Finance Financial Ratios

Scrape Yahoo Finance Income Statement with Python

但我什至无法从页面中提取数据

也试过这个网站：

https://hackernoon.com/scraping-yahoo-finance-data-using-python-ayu3zyl

这是我写的获取网页数据的代码

首先导入包：

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq

然后尝试从页面中提取数据：

Url= "https://finance.yahoo.com/quote/BABA/analysis?p=BABA"
r = requests.get(Url)
data = r.text
soup = BeautifulSoup(data,features="lxml")

查看“数据”和“汤”对象的类型时我看到了

type(data)
<class 'str'>

我可以使用正则表达式以某种方式提取“未来 5 年”行所需的数据。

但是当看的时候

type(soup)
<class 'bs4.BeautifulSoup'>

并且由于某种原因，其中的数据与页面无关。

看起来像那样（仅复制了汤对象中的 small 部分内容）：

soup
<!DOCTYPE html>
<html class="NoJs featurephone" id="atomic" lang="en-US"><head prefix="og: 
http://ogp.me/ns#"><script>window.performance && window.performance.mark &&  
window.performance.mark('PageStart');</script><meta charset="utf-8"/> 
<title>Alibaba Group Holding Limited (BABA) Analyst Ratings, Estimates &amp; 
Forecasts - Yahoo Finance</title><meta con 
tent="recommendation,analyst,analyst 
rating,strong buy,strong 
sell,hold,buy,sell,overweight,underweight,upgrade,downgrade,price target,EPS 
estimate,revenue estimate,growth estimate,p/e 
estimate,recommendation,analyst,analyst rating,strong buy,strong 
sell,hold,buy,sell,overweight,underweight,upgrade,downgrade,price target,EPS 
estimate,revenue estimate,growth estimate,p/e estimate" name="keywords"/> 
<meta   content="on" http-equiv="x-dns-prefetch-control"/><meta content="on" 
property="twitter:dnt"/><meta content="90376669494" property="fb:app_id"/> 
<meta content="#400090" name="theme-color"/><meta content="width=device- 
width,

有没有其他方法可以从对象数据中提取非正则表达式所需的数据？
soup 对象如何帮助我提取数据（我看到它被大量使用，但不知道如何变得有用）？

提前致谢

【问题讨论】：

您可以使用soup 对象在您的页面上找到相应的
元素。按照本教程中描述的步骤操作：link。
我尝试了链接，但我的汤对象看起来不像您添加的链接（我对问题进行了一些编辑，以便您查看外观）。汤对象似乎没有来自页面的数据，而由于某种原因，数据对象确实包含来自页面的信息

【解决方案1】：

一种解决方案是使用正则表达式从 JS 中的 JSON 数据中提取值。 JSON 数据位于以下变量中：

root.App.main = { .... };

例子：

import requests 
import re
import json

r = requests.get("https://finance.yahoo.com/quote/BABA/analysis?p=BABA")

data = json.loads(re.search('root\.App\.main\s*=\s*(.*);', r.text).group(1))

field = [t for t in data["context"]["dispatcher"]["stores"]["QuoteSummaryStore"]["earningsTrend"]["trend"] if t["period"] == "+5y" ][0]

print(field)
print("Next 5 Years (per annum) : " + field["growth"]["fmt"])

【讨论】：

看起来可行，谢谢！ @伯特兰·马特尔。可能会询问您输入到 r 对象的 URL：link 与我发布的不一样，您是如何收到/找到的？
@TaL 原始 URL 也可以，我刚刚通过删除最后一个查询参数进行了测试
这个网址“finance.yahoo.com/quote/BABA/analysis?p=BABA”？对于 json 部分，如果您打开 Chrome 开发者控制台并检查 html，您可以搜索变量 root.App.main。 "'root\.App\.main\s*=\s*(.*);'" 是一个正则表达式，它提取 root.App.main = 之后和 ";" 之前的所有内容
如果可以，我可以再问你 2 个问题 - 你怎么知道使用 json.loads 方法而不是 BeautifulSoup 对象之类的东西？你怎么知道在里面输入“'root\.App\.main\s*=\s*(.*);'”？再次感谢
re.search('root\.App\.main\s*=\s*(.*);', r.text).group(1) 返回 JSON 文本。为了解析该 JSON，我使用了 json.loads(jsontext)