【发布时间】:2020-05-19 19:55:58
【问题描述】:
我正在开发以下代码来从特定网站源中抓取财务数据。
import requests
import pandas as pd
urls = ['https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter',
'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter']
def main(urls):
with requests.Session() as req:
goal = []
for url in urls:
r = req.get(url)
df = pd.read_html(
r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
goal.append(df)
new = pd.concat(goal)
print(new)
main(urls)
我正在获取我需要的信息。
2017 2018 2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
0 (12.77B) (13.71B) (14.12B) NaN NaN NaN
0 NaN NaN NaN (3.48B) (3.54B) (3.38B)
0 (11.85B) (12.7B) (13.81B) NaN NaN NaN
0 NaN NaN NaN (3.51B) (3.89B) (3.88B)
我需要搜索至少 20 家公司(来自同一来源)。 URL 基本相同,除了一个元素(我将其称为 index)
https://www.marketwatch.com/investing/stock/' + index + '/financials/cash-flow'
有没有办法添加一个名为Index
的变量并使用变量Index
进行迭代类似:
import requests
import pandas as pd
Index = 'MSFT, AAPL'
和
urls = ['https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow/quarter']
【问题讨论】:
标签: python python-3.x web-scraping beautifulsoup scrape