【问题标题】:Scraping data attribute with BeautifulSoup使用 BeautifulSoup 抓取数据属性
【发布时间】:2020-08-09 16:26:07
【问题描述】:

我正在学习使用 BeautifulSoup 进行网页抓取。目标是从金融网站提取数字以供我个人评估。到目前为止,这就是我所做的:

import bs4
import requests
r=requests.get('https://www.finnomena.com/stock/CPALL')
r.text
html_page=bs4.BeautifulSoup(r.text, 'html.parser')

然后我尝试使用 find/find_all 提取每行末尾的数字(3.69、3.60、0.31 等),但不知道如何引用这些数据,因为我从未见过以前这种格式的元素:

<div data-v-30581cd9="" class="data-wrapper sub-topic first-sub-topic">
  <div data-v-30581cd9="" class="data-each">3.69</div>
  <div data-v-30581cd9="" class="data-each">3.60</div>
  <div data-v-30581cd9="" class="data-each">0.31</div>
  <div data-v-30581cd9="" class="data-each">10.26</div>
  <div data-v-30581cd9="" class="data-each">1.58</div>
  <div data-v-30581cd9="" class="data-each">4.73</div>
  <div data-v-30581cd9="" class="data-each">2.64</div>
  <div data-v-30581cd9="" class="data-each">-3.31</div>
  <div data-v-30581cd9="" class="data-each">10.49</div>
  <div data-v-30581cd9="" class="data-each">6.83</div>
  <div data-v-30581cd9="" class="data-each">7.38</div>
    .
    .
    .
  <div data-v-30581cd9="" class="data-each">4.88</div>
  <div data-v-30581cd9="" class="data-each">-1.40</div>
  <div data-v-30581cd9="" class="data-each"></div>
</div>

尝试查看旧主题并已经进行了一些研究,但找不到我想要的。如何提取这些值?

【问题讨论】:

标签: python html web-scraping beautifulsoup


【解决方案1】:
elements = soup.find_all("div", class_="data-each")
text = [i.text for i in elements]
#text is now a list of text data of that div i.e. 3.69, 3.60 ...

【讨论】:

  • 您好,非常感谢您的回答,但我已经尝试过并打印(文本)。结果显示“[ ]”。不知道我错过了什么。
  • 工作正常..确保您已加载正确的数据
【解决方案2】:
import bs4

fh = open('data.html', 'r')
html_page = bs4.BeautifulSoup(fh, 'html.parser')

elements = html_page.find_all("div", class_="data-each")

values = list()

for value in elements:
    values.append(value.text)

print(values)

我已将此文件作为 data.html 保存在我的笔记本电脑上。我希望这能解决您的问题。

【讨论】:

  • 这会打印出您所需值的列表。
【解决方案3】:

您在页面上看到的数据是通过 Ajax 从其他 URL 加载的。您可以通过requests/json 模块获取它:

import json
import requests
from bs4 import BeautifulSoup


url = 'https://www.finnomena.com/stock/CPALL'
api_url = 'https://www.finnomena.com/fn3/api/stock/financial'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')
securityID = soup.select_one('#sec-id').text
data = requests.get(api_url, params={'securityID': securityID}).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for d in data['data']:
    print(json.dumps(d, indent=4))
    print('-' * 80)

打印:

...

--------------------------------------------------------------------------------
{
    "SecurityID": 4086,
    "Fiscal": 2019,
    "Quarter": 2,
    "Cash": "31370218.00000",
    "DA": "2730218.00000",
    "DebtToEquity": "3.2650",
    "Equity": "83260746.00000",
    "EarningPerShare": "0.51000",
    "EarningPerShareYoY": "2.0000",
    "EarningPerShareQoQ": "-16.3900",
    "GPM": "21.7900",
    "GrossProfit": "31214912.00000",
    "NetProfit": "4794614.00000",
    "NetProfitYoY": "0.3200",
    "NetProfitQoQ": "-16.8900",
    "NPM": "3.3500",
    "Revenue": "143237802.00000",
    "RevenueYoY": "10.4300",
    "RevenueQoQ": "3.1300",
    "ROA": "0.0129",
    "ROE": "0.0553",
    "SGA": "28848813.00000",
    "SGAPerRevenue": "20.1400",
    "TotalDebt": "271842760.00000",
    "DividendYield": "1.40",
    "BookValuePerShare": "10.04",
    "Close": "86.00",
    "MKTCap": "772546715.92800",
    "PriceEarningRatio": "36.30",
    "PriceBookValue": "8.57",
    "EVPerEbitDA": "23.44566",
    "EbitDATTM": "43207107.37300",
    "PaidUpCapital": "8983101.00000",
    "CashCycle": "-39.09884",
    "OperatingActivities": "8059168.00000",
    "InvestingActivities": "-2663473.00000",
    "FinancingActivities": "-9791401.00000",
    "Asset": "369666475.00000"
}
--------------------------------------------------------------------------------

...

【讨论】:

  • 非常感谢!根据我的知识,我会更多地研究 json 和 ajax 的东西。
猜你喜欢
  • 1970-01-01
  • 2023-03-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多