【问题标题】:Scraping multiple (market index) sites with BeautifulSoup使用 BeautifulSoup 抓取多个(市场指数)网站
【发布时间】:2020-05-19 19:55:58
【问题描述】:

我正在开发以下代码来从特定网站源中抓取财务数据。

import requests
import pandas as pd


urls = ['https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow',
        'https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter',
        'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow',
        'https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter']


def main(urls):
    with requests.Session() as req:
        goal = []
        for url in urls:
            r = req.get(url)
            df = pd.read_html(
                r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
            goal.append(df)
        new = pd.concat(goal)
        print(new)


main(urls)

我正在获取我需要的信息。

      2017      2018      2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
0  (12.77B)  (13.71B)  (14.12B)         NaN         NaN         NaN
0       NaN       NaN       NaN     (3.48B)     (3.54B)     (3.38B)
0  (11.85B)   (12.7B)  (13.81B)         NaN         NaN         NaN
0       NaN       NaN       NaN     (3.51B)     (3.89B)     (3.88B)

我需要搜索至少 20 家公司(来自同一来源)。 URL 基本相同,除了一个元素(我将其称为 index

https://www.marketwatch.com/investing/stock/' + index + '/financials/cash-flow'

有没有办法添加一个名为Index

的变量

并使用变量Index

进行迭代

类似:

   import requests
   import pandas as pd
   Index = 'MSFT, AAPL'

urls = ['https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow',
'https://www.marketwatch.com/investing/stock/' + Index + '/financials/cash-flow/quarter']

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup scrape


    【解决方案1】:

    简单的解决方案,您可以使用循环内循环和字符串格式来构造所需的 URL。

    例如:

    import requests
    import pandas as pd
    
    indexes = 'aapl', 'MSFT', 'F'
    
    def main(indexes):
        urls = ['https://www.marketwatch.com/investing/stock/{index}/financials/cash-flow',
                'https://www.marketwatch.com/investing/stock/{index}/financials/cash-flow/quarter']
        goal = []
    
        with requests.Session() as req:
            for index in indexes:
                for url in urls:
                    url = url.format(index=index)
                    print('Processing url', url)
                    r = req.get(url)
                    df = pd.read_html(
                        r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 3:6]
                    goal.append(df)
            new = pd.concat(goal)
            print(new)
    
    main(indexes)
    

    打印:

    Processing url https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow
    Processing url https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow/quarter
    Processing url https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow
    Processing url https://www.marketwatch.com/investing/stock/MSFT/financials/cash-flow/quarter
    Processing url https://www.marketwatch.com/investing/stock/F/financials/cash-flow
    Processing url https://www.marketwatch.com/investing/stock/F/financials/cash-flow/quarter
           2017      2018      2019 30-Sep-2019 31-Dec-2019 31-Mar-2020
    0  (12.77B)  (13.71B)  (14.12B)         NaN         NaN         NaN
    0       NaN       NaN       NaN     (3.48B)     (3.54B)     (3.38B)
    0  (11.85B)   (12.7B)  (13.81B)         NaN         NaN         NaN
    0       NaN       NaN       NaN     (3.51B)     (3.89B)     (3.88B)
    0   (2.58B)   (2.91B)   (2.39B)         NaN         NaN         NaN
    0       NaN       NaN       NaN      (598M)      (595M)      (596M)
    

    【讨论】:

      猜你喜欢
      • 2020-06-27
      • 2018-07-07
      • 2019-12-15
      • 2021-10-26
      • 1970-01-01
      • 1970-01-01
      • 2018-05-31
      • 2020-05-30
      相关资源
      最近更新 更多