【问题标题】:BeautifulSoup: Scraping CSV list of URLsBeautifulSoup:抓取 CSV 的 URL 列表
【发布时间】:2020-09-02 10:44:46
【问题描述】:

我一直在尝试从不同的 url 下载数据,然后将其保存到 csv 文件中。

这个想法是从以下位置提取突出显示的数据:https://www.marketwatch.com/investing/stock/MMM/financials/cash-flow

到目前为止,我构建了以下代码:

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur

url_is = 'https://www.marketwatch.com/investing/stock/MMM/financials/cash-flow'
read_data = ur.urlopen(url_is).read()
soup_is=BeautifulSoup(read_data, 'lxml')
row = soup_is.select_one('tr.mainRow>td.rowTitle:contains("Cash Dividends Paid - Total")')
data=[cell.text for cell in row.parent.select('td') if cell.text!='']
df=pd.DataFrame(data)
print(df.T)

我得到一个输出:

到目前为止一切顺利。

现在我的想法是从多个 URL 中提取特定的类,从网站中保留相同的标题并将其导出到 .csv

标签和类保持不变

示例网址:

https://www.marketwatch.com/investing/stock/MMM/financials/cash-flow
https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow

代码(我想尝试 2 列:2015 和 2016

作为预期的输出,我想要类似的东西:

我写了以下代码,但给我带来了问题,欢迎任何帮助或建议:

import pandas as pd
from bs4 import BeautifulSoup
import urllib.request as ur
import numpy as np
import requests


links = ['https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow', 'https://www.marketwatch.com/investing/stock/MMM/financials/cash-flow']

container = pd.DataFrame(columns=['Name', 'Name2'])
pos=0
for l in links:
    read_data = ur.urlopen(l).read()
    soup_is=BeautifulSoup(read_data, 'lxml')
    row = soup_is.select_one('tr.mainRow>td.rowTitle:contains("Cash Dividends Paid - Total")')
    results=[cell.text for cell in row.parent.select('td') if cell.text!='']
    records = []

    for result in results:
      records = []
      Name = result.find('span', attrs={'itemprop':'2015'}).text if result.find('span', attrs={'itemprop':'2015'}) is not None else ''

      Name2 = result.find('span', attrs={'itemprop':'2016'}).text if result.find('span', attrs={'itemprop':'2016'}) is not None else ''

      records.append(Name)
      records.append(Name2)

      container.loc[pos] = records
      pos+=1

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup export-to-csv


    【解决方案1】:
    import requests
    import pandas as pd
    
    urls = ['https://www.marketwatch.com/investing/stock/aapl/financials/cash-flow',
            'https://www.marketwatch.com/investing/stock/MMM/financials/cash-flow']
    
    
    def main(urls):
        with requests.Session() as req:
            goal = []
            for url in urls:
                r = req.get(url)
                df = pd.read_html(
                    r.content, match="Cash Dividends Paid - Total")[0].iloc[[0], 0:3]
                goal.append(df)
            new = pd.concat(goal)
            print(new)
    
    
    main(urls)
    

    【讨论】:

      猜你喜欢
      • 2022-01-18
      • 2022-01-05
      • 1970-01-01
      • 2016-08-08
      • 1970-01-01
      • 2021-09-01
      • 1970-01-01
      • 2014-06-20
      相关资源
      最近更新 更多