【问题标题】:Scrape HTML using Beautifulsoup使用 Beautifulsoup 抓取 HTML
【发布时间】:2020-03-24 03:16:05
【问题描述】:

我一直在尝试使用以下代码从下表中提取数据。

Link , Wanted Data

test=[]

page = requests.get('http://www.thaibma.or.th/EN/BondInfo/BondFeature/Issue.aspx?symbol=ba891dbb-f614-e711-b77e-78e3b51dab3c')
soup = BeautifulSoup(page.text, 'html.parser')
finddata = soup.findAll('p')
for i in finddata:
    test.append(i.find(text=True))

print(test)

我想要的所有信息都在标签“p”中,但是当我打印此代码时,文本变为空白。

是否有任何其他工具可以用来提取这些数据?

【问题讨论】:

  • 当您打开该页面时,表格一开始是空的,这就是原因。
  • 是的,数据是由javascript动态生成的,查看这个答案的解释 --> stackoverflow.com/questions/45448994/…
  • 不确定我是否做得正确。尝试按照共享的链接进行操作,但仍然无法解决@dzakyputra

标签: python python-3.x beautifulsoup


【解决方案1】:

网站加载了JavaScript 事件,该事件在页面加载后动态呈现其数据。

requests 库将无法即时渲染 JavaScript。所以你可以使用seleniumrequests_html。确实有很多模块可以做到这一点。

现在,我们在表格上确实有另一个选项,可以跟踪数据的呈现位置。我能够找到 XHR 请求,该请求用于从 back-end API 检索数据并将其呈现给用户端。

您可以通过打开Developer-Tools 并检查Network 和检查XHR/JS 请求来获取XHR 请求,具体取决于调用类型,例如fetch

import requests
from bs4 import BeautifulSoup
import json


with requests.Session() as req:
    r = req.get(
        "http://www.thaibma.or.th/EN/BondInfo/BondFeature/Issue.aspx?symbol=2dd6bca6-2543-ea11-a2f0-959434d0c31a")
    soup = BeautifulSoup(r.content, 'html.parser')
    token = soup.find("input", id="token").get("value")
    time = soup.find("input", id="time").get("value")
    headers = {
        'Token': token,
        'timestamp': time,
        'X-Requested-With': 'XMLHttpRequest',
        'Referer': 'http://www.thaibma.or.th/EN/BondInfo/BondFeature/Issue.aspx?symbol=2dd6bca6-2543-ea11-a2f0-959434d0c31a'
    }
    r = req.get(
        "http://www.thaibma.or.th/issue/feature?Symbol=2DD6BCA6-2543-EA11-A2F0-959434D0C31A", headers=headers).json()
    print(json.dumps(r, indent=4)) # to see the output in nice format.
    print("*" * 10)
    print(r.keys()) # you can access whatever as it's JSON dict now.

输出:

{
    "IssueID": "2dd6bca6-2543-ea11-a2f0-959434d0c31a",
    "IssueLegacyId": 76734,
    "Symbol": "BANPU20O22A",
    "SymbolTitle": "BANPU20O22A : Bill of Exchange of BANPU PUBLIC COMPANY LIMITED worth of 
THB 1,500.00 mln. due October 22, 2020 (BANPU20O22A)",
    "RegistrationDate": "2020-01-30T00:00:00",
    "IssueNameTh": "\u0e15\u0e31\u0e4b\u0e27\u0e41\u0e25\u0e01\u0e40\u0e07\u0e34\u0e19 \u0e1a\u0e23\u0e34\u0e29\u0e31\u0e17 \u0e1a\u0e49\u0e32\u0e19\u0e1b\u0e39 \u0e08\u0e33\u0e01\u0e31\u0e14 (\u0e21\u0e2b\u0e32\u0e0a\u0e19) \u0e21\u0e39\u0e25\u0e04\u0e48\u0e32 1,500.00 \u0e25\u0e49\u0e32\u0e19\u0e1a\u0e32\u0e17 \u0e04\u0e23\u0e1a\u0e01\u0e33\u0e2b\u0e19\u0e14\u0e44\u0e16\u0e48\u0e16\u0e2d\u0e19\u0e27\u0e31\u0e19\u0e17\u0e35\u0e48 22 \u0e15\u0e38\u0e25\u0e32\u0e04\u0e21 2563 (BANPU20O22A)",
    "IssueNameEn": "BANPU PUBLIC COMPANY LIMITED",
    "IsinTh": "0",
    "IsinEn": "0",
    "ClaimNameEn": "Senior",
    "SecureType": "Unsecured",
    "PrincipalPayment": "",
    "SustainabilityGoal": "",
    "CurrencyCode": "THB",
    "InitialPar": 1000.0,
    "CurrentPar": 1000.0,
    "IssueSize": 1500.0,
    "OutstandingSize": 1500.0,
    "IssuedDate": "2020-01-30T00:00:00",
    "MaturityDate": "2020-10-22T00:00:00",
    "IssueTerm": 0.7287671232876712,
    "CouponFrequencyNameEn": "At Maturity",
    "AccrualBasisNameEn": "Actual/365",
    "EmbbeddedOption": "-",
    "DistributionNameEn": "Institutional Investors",
    "CollateralRemark": "-",
    "IssueRemark": "Please be informed that the number shown in the \"Initial Par\" and \"Current Par\" do not represent the correct number.",
    "RiskLevelId": "6a8573d4-906a-ea11-a2f1-dca009a9f3d7",
    "RiskLevel": 3,
    "ProspectusId": null,
    "issuer_id": "ac90981d-e5f8-e111-93f5-78e3b51dab3c",
    "issuer_code": "BANPU"
}
**********
dict_keys(['IssueID', 'IssueLegacyId', 'Symbol', 'SymbolTitle', 'RegistrationDate', 'IssueNameTh', 'IssueNameEn', 'IsinTh', 'IsinEn', 'ClaimNameEn', 'SecureType', 'PrincipalPayment', 'SustainabilityGoal', 'CurrencyCode', 'InitialPar', 'CurrentPar', 'IssueSize', 'OutstandingSize', 'IssuedDate', 'MaturityDate', 'IssueTerm', 'CouponFrequencyNameEn', 'AccrualBasisNameEn', 'EmbbeddedOption', 'DistributionNameEn', 'CollateralRemark', 'IssueRemark', 'RiskLevelId', 'RiskLevel', 'ProspectusId', 'issuer_id', 'issuer_code'])

【讨论】:

  • 谢谢!!这真的很有帮助!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-05-12
  • 2020-06-03
  • 2021-01-11
  • 2016-10-28
  • 1970-01-01
相关资源
最近更新 更多