Python BeautifulSoup - 抓取 Google Finance 历史数据答案

【问题标题】：Python BeautifulSoup - Scraping Google Finance historical dataPython BeautifulSoup - 抓取 Google Finance 历史数据
【发布时间】：2016-08-19 08:02:46
【问题描述】：

我试图废弃 Google 财经的历史数据。我需要总行数，它与分页一起定位。下面是负责显示总行数的div标签：

<div class="tpsd">1 - 30 of 1634 rows</div>

我尝试使用以下代码获取数据，但它返回一个空列表：

soup.find_all('div', 'tpsd')

我尝试获取整个表，但即便如此我也没有成功，当我检查页面源时，我能够在 JavaScript 函数中找到值。当我在 Google 上搜索如何从脚本标签中获取值时，有人提到使用了正则表达式。所以，我尝试使用正则表达式，以下是我的代码：

import requests
import re
from bs4 import BeautifulSoup
r = requests.get('https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ')
soup = BeautifulSoup(r.content,'lxml')
var = soup.find_all("script")[8].string
a = re.compile('google.finance.applyPagination\((.*)\'http', re.DOTALL)
b =  a.search(var)
num = b.group(1)
print(num.replace(',','').split('\n')[3])

我能够得到我想要的值，但我怀疑我用来获取值的上述代码是否正确，或者有没有其他更好的方法。请帮忙。

【问题讨论】：

doubt is 我用来获取值的上述代码是否正确是什么意思？它能满足你的需要吗？
@PadraicCunningham 是的。我从脚本标签中得到了我想要的值。但是我没有通过使用 div 标签来获取值。有没有办法使用 div 标签获取值？
如果你想解析你在浏览器中看到的页面，你需要像 selenium 这样可以运行 Javascript 的东西，你是想解析表格还是究竟是什么？
@PadraicCunningham 不，我试图获取位于分页附近的总行数。我认为可能有一些方法可以从 div 标签中获取值

标签： javascript python beautifulsoup

【解决方案1】：

您可以轻松地将偏移量（即 start=..）传递给一次获取 30 行的 url，这正是分页逻辑所发生的：

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
      "enddate=Aug+18%2C+2016&num=30&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"


with requests.session() as s:
    start = 0
    req = s.get(url.format(start))
    soup = BeautifulSoup(req.content, "lxml")
    table = soup.select_one("table.gf-table.historical_price")
    all_rows = table.find_all("tr")
    while True:
        start += 30
        soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
        table = soup.select_one("table.gf-table.historical_price")
        if not table:
            break
        all_rows.extend(table.find_all("tr"))

您还可以使用脚本标签获取总行数并将其与范围一起使用：

with requests.session() as s:
    req = s.get(url.format(0))
    soup = BeautifulSoup(req.content, "lxml")
    table = soup.select_one("table.gf-table.historical_price")
    scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
    total = int(scr.text.split(",", 3)[2])
    all_rows = table.find_all("tr")

    for start in range(30, total+1, 30):
        soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
        table = soup.select_one("table.gf-table.historical_price")
        all_rows.extend(table.find_all("tr"))
print(len(all_rows))

num=30 是每页的行数，为了减少请求，您可以将其设置为 200，这似乎是最大值，然后从该值开始计算步长/偏移量。

url = "https://www.google.com/finance/historical?cid=13564339&startdate=Jan+01%2C+2010&" \
      "enddate=Aug+18%2C+2016&num=200&ei=ilC1V6HlPIasuASP9Y7gAQ&start={}"


with requests.session() as s:
    req = s.get(url.format(0))
    soup = BeautifulSoup(req.content, "lxml")
    table = soup.select_one("table.gf-table.historical_price")
    scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
    total = int(scr.text.split(",", 3)[2])
    all_rows = table.find_all("tr")
    for start in range(200, total+1, 200):
        soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
        print(url.format(start)
        table = soup.select_one("table.gf-table.historical_price")
        all_rows.extend(table.find_all("tr"))

如果我们运行代码，你会看到我们得到 1643 行：

In [7]: with requests.session() as s:
   ...:         req = s.get(url.format(0))
   ...:         soup = BeautifulSoup(req.content, "lxml")
   ...:         table = soup.select_one("table.gf-table.historical_price")
   ...:         scr = soup.find("script", text=re.compile('google.finance.applyPagination'))
   ...:         total = int(scr.text.split(",", 3)[2])
   ...:         all_rows = table.find_all("tr")
   ...:         for start in range(200, total+1, 200):
   ...:                 soup = BeautifulSoup(s.get(url.format(start)).content, "lxml")
   ...:                 table = soup.select_one("table.gf-table.historical_price")
   ...:                 all_rows.extend(table.find_all("tr"))
   ...:         print(len(all_rows))
   ...:         

1643

In [8]:

【讨论】：

【解决方案2】：

您可以只使用 python 模块：https://pypi.python.org/pypi/googlefinance

api很简单：

#The google finance API that we need.
from googlefinance import getQuotes
#The json handeler, since the API returns a JSON.
import json


intelJSON = (getQuotes('INTC'))

intelDump = json.dumps(intelJSON, indent=2)

intelInfo = json.loads(intelDump)

intelPrice = intelInfo[0]['LastTradePrice']
intelTime  = intelInfo[0]['LastTradeDateTimeLong']

print ("As of " + intelTime  + ", Intel stock is trading at: " + intelPrice)

【讨论】：

我还能获取历史数据吗？
在 GitHub 页面中提到如下：“此模块提供 NYSE 和 NASDAQ 的无延迟实时股票数据。”我猜它没有得到历史数据。
非常感谢@Rich

【解决方案3】：

我更喜欢拥有所有可从 Google 财经下载的原始 CSV 文件。我编写了一个快速的 Python 脚本来自动下载公司列表的所有历史价格信息——这相当于人类手动使用“下载到电子表格”链接的方式。

这里是 GitHub 存储库，其中包含所有标准普尔 500 指数股票的下载 CSV 文件（在 rawCSV 文件夹中）：https://github.com/liezl200/stockScraper

它使用这个链接http://www.google.com/finance/historical?q=googl&startdate=May+3%2C+2012&enddate=Apr+30%2C+2017&output=csv，这里的关键是最后一个输出参数output=csv。我使用 urllib.urlretrieve(download_url, local_csv_filename) 来检索 CSV。

【讨论】：

我有这个想法，但是每当我想更新时，这可能需要一些时间。感谢您的回复。