【问题标题】:Can't scrape a link connected to some download button from a page using requests无法使用请求从页面中抓取连接到某个下载按钮的链接
【发布时间】:2020-12-15 22:23:42
【问题描述】:

我正在尝试使用请求模块从 webpage 下载 csv 文件。这个想法是解析连接到download button 的链接,以便我可以使用该链接下载 csv 文件。我试图抓取的链接是动态链接,但到目前为止我注意到总有一些方法可以找到它。但是,我就是无法实现。

我试过了:

import requests
from bs4 import BeautifulSoup

link = "https://finance.yahoo.com/quote/AAPL/history?p=AAPL"

r = requests.get(link)
soup = BeautifulSoup(r.text,"html.parser")
file_link = soup.select_one("a[href='/finance/download/']").get("href")
print(file_link)

通过上述尝试,脚本会抛出 AttributeError:,因为它在该站点中找不到链接。

如何使用请求从该页面获取下载链接?

【问题讨论】:

  • 使用selenium方式下载文件
  • 感谢我们的建议@Vin,但我不愿意使用硒。谢谢。

标签: python python-3.x web-scraping beautifulsoup python-requests


【解决方案1】:

似乎下载 CSV 的链接是通过 JavaScript 动态构建的。但是您可以使用 Python 构建类似的链接:

import requests
from datetime import datetime


csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'

quote = 'AAPL'
from_ = datetime(2019,9,27,0,0).strftime('%s')
to_ = datetime(2020,9,27,23,59).strftime('%s')

print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)

打印:

Date,Open,High,Low,Close,Adj Close,Volume
2019-09-27,220.539993,220.960007,217.279999,218.820007,216.670242,25352000
2019-09-30,220.899994,224.580002,220.789993,223.970001,221.769623,25977400
2019-10-01,225.070007,228.220001,224.199997,224.589996,222.383545,34805800
2019-10-02,223.059998,223.580002,217.929993,218.960007,216.808853,34612300
2019-10-03,218.429993,220.960007,215.130005,220.820007,218.650574,28606500
2019-10-04,225.639999,227.490005,223.889999,227.009995,224.779770,34619700
2019-10-07,226.270004,229.929993,225.839996,227.059998,224.829269,30576500
2019-10-08,225.820007,228.059998,224.330002,224.399994,222.195404,27955000
2019-10-09,227.029999,227.789993,225.639999,227.029999,224.799576,18692600
2019-10-10,227.929993,230.440002,227.300003,230.089996,227.829498,28253400
2019-10-11,232.949997,237.639999,232.309998,236.210007,233.889374,41698900
2019-10-14,234.899994,238.130005,234.669998,235.869995,233.552719,24106900
2019-10-15,236.389999,237.649994,234.880005,235.320007,233.008133,21840000
2019-10-16,233.369995,235.240005,233.199997,234.369995,232.067444,18475800
2019-10-17,235.089996,236.149994,233.520004,235.279999,232.968521,16896300
2019-10-18,234.589996,237.580002,234.289993,236.410004,234.087433,24358400
2019-10-21,237.520004,240.990005,237.320007,240.509995,238.147125,21811800
2019-10-22,241.160004,242.199997,239.619995,239.960007,237.602539,20573400
2019-10-23,242.100006,243.240005,241.220001,243.179993,240.790909,18957200

...and so on.

编辑:

import requests
from datetime import datetime


csv_link = 'https://query1.finance.yahoo.com/v7/finance/download/{quote}?period1={from_}&period2={to_}&interval=1d&events=history'

quote = 'AAPL'
from_ = str(datetime.timestamp(datetime(2019,9,27,0,0))).split('.')[0]
to_ = str(datetime.timestamp(datetime(2020,9,27,23,59))).split('.')[0]

print(requests.get(csv_link.format(quote=quote, from_=from_, to_=to_)).text)

【讨论】:

  • 是的,这就是我正在寻找的答案。不过,我在执行脚本时遇到了一些麻烦。执行脚本时,我收到此错误from_ = datetime(2019,9,27,0,0).strftime('%s') ValueError: Invalid format string 顺便说一句,我正在使用python 3.7.0
  • @MITHU 查看我的编辑。基本上,你需要datetime 到 UNIX 时间戳。
  • 是的,我注意到了。非常感谢。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2021-03-04
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-09-23
  • 2020-12-25
  • 2021-04-26
相关资源
最近更新 更多