用 Beautiful Soup 抓取 tiingo HTML答案

【问题标题】：Scraping tiingo HTML with Beautiful Soup用 Beautiful Soup 抓取 tiingo HTML
【发布时间】：2016-08-20 22:35:41
【问题描述】：

我希望从 tiingo.com 上各自的网页上抓取标准普尔 500 指数中多家公司的财务数据

例如，取如下网址：

https://www.tiingo.com/f/b/aapl

显示 Apple 最新的资产负债表数据

我希望提取最近一个季度的“Property, Plant & Equipment”金额，在本例中为 25.45B。但是，我无法编写正确的 Beautiful Soup 代码来提取此文本。

检查元素，我发现 25.45B 数字位于元素内的类“ng-binding ng-scope”和类“col-xs-6 col-sm-3 col-md-3 col”内-lg-3 statement-field-data ng-scope”，它本身属于类“col-xs-7 col-sm-8 col-md-8 col-lg-9 no-padding-left no-padding-”对。”

但是，我不确定如何准确编写 Beautiful Soup 代码来定位正确的元素，然后执行 element.getText() 函数。

我在想这样的事情：

import os, bs4, requests

res_bal = requests.get("https://www.tiingo.com/f/b/aapl")

res_bal.raise_for_status()

soup_bal = bs4.BeautifulSoup(res_bal.text, "html.parser")

elems_bal = soup_bal.select(".col-xs-6 col-sm-3 col-md-3 col-lg-3 statement-field-data ng-scope")

elems_bal_2 = elems_bal.select(".ng-binding ng-scope")

joe = elems_bal_2.getText()

print(joe)

但到目前为止，我还没有成功使用此代码。任何帮助将不胜感激！

【问题讨论】：

内容是使用 Javascript 加载的，因此不在您返回的源中
另外，soup_bal.select(".col-xs-6 col-sm-3 col-md-3 col-lg-3 statement-field-data ng-scope") 甚至远未接近正确。您可能需要阅读文档crummy.com/software/BeautifulSoup/bs4/doc
我是 Tiingo 的创始人，这种抓取违反了条款。只需每月 50 美元从 quandl.com/sf1 购买个人许可证。 Sharadar 的创始人是个好人，非常努力地保持这个数据集的干净。

标签： html python-3.x web-scraping beautifulsoup tiingo

【解决方案1】：

选择器的问题

elems_bal = soup_bal.select(".col-xs-6 col-sm-3 col-md-3 col-lg-3 statement-field-data ng-scope")

elems_bal_2 = elems_bal.select(".ng-binding ng-scope")

也就是说，页面上存在多个具有相同类的元素，因此您没有得到正确的结果。

注意如果你只使用beautifulsoup和request，那么页面源中的内容没有你想要抓取的数据，这可以做到在 selenium 和 beautifulsoup 的帮助下，您可以做到：如果您没有 selenium，请先安装：pip install selenium

这是相同的工作代码：

from selenium import webdriver
import  bs4, time

driver = webdriver.Firefox()   
driver.get("https://www.tiingo.com/f/b/aapl")
driver.maximize_window()
# sleep is given so that JS populate data in this time
time.sleep(10)
pSource= driver.page_source

soup = bs4.BeautifulSoup(pSource, "html.parser")

Property=soup.findAll('div',{'class':'col-xs-5 col-sm-4 col-md-4 col-lg-3 statement-field-name indent-2'})
for P in Property:
    if 'Property' in P.text.strip():
        print P.text

x=soup.find("a",{"ng-click":"toggleFundData('Property, Plant & Equipment',SDCol.restatedString==='restated',true)"})
print x.text

同样的输出是：

Property, Plant & Equipment
25.45B

【讨论】：

感谢您的回复，坏人！！