【问题标题】:AJAX web scraping using python Requests使用 python 请求的 AJAX Web 抓取
【发布时间】:2021-10-16 11:04:33
【问题描述】:

我试图抓取this website,但没有得到表格数据。我什至从 Chrome 开发工具中获得了请求数据,但我无法找出我做错了什么。

这是我的脚本:

import requests,json
url='https://www.assetmanagement.hsbc.de/api/v1/nav/funds'
payload={"appliedFilters":[[{"active":True,"id":"Yes"}]],"paging":{"fundsPerPage":-1,"currentPage":1},"view":"Documents","searchTerm":[],"selectedValues":[],"pageInformation":{"country":"DE","language":"DE","investorType":"INST","tokenIssue":{"url":"/api/v1/token/issue"},"dataUrl":{"url":"/api/v1/nav/funds","id":"e0FFNDg5MTJELUFEMzEtNEQ5RC04MzA4LTdBQzZERTgyQTc4Rn0="},"shareClassUrl":{"url":"/api/v1/nav/shareclass","id":"ezUxODdjODJiLWY1YmItNDIzOC1hM2Y0LWY5NzZlY2JmMTU3OX0="},"filterUrl":{"url":"/api/v1/nav/filters","id":"ezRFREYxQTU3LTVENkYtNDBDRC1CMjJDLTQ0NDc4Nzc1NTlFQn0="},"presentationUrl":{"url":"/api/v1/nav/presentation","id":"e0E1NEZDODZGLUE5MDctNDUzQi04RTYyLTIxNDNBMEM1MEVGQ30="},"liveDataUrl":{"id":"ezlEMjA2MDk5LUNCRTItNENGMy1BRThBLUM0RTMwMEIzMjlDQ30="},"fundDetailPageUrl":"/de/institutional-investors/fund-centre","forceHttps":True}}
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"}
r = requests.post(url,headers=headers,data=payload)
print(r.content)

【问题讨论】:

    标签: python ajax web-scraping python-requests


    【解决方案1】:

    虽然它最初缺少 IFC-Cache-Header http 标头,但还有一个通过 Authorization 标头传递的 JWT 令牌。

    要检索此令牌,您首先需要从根页面中提取值:

    GET https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre
    

    具有以下 javacript 对象:

    window.HSBC.dpas = {
        "pageInformation": {
            "country": "X", <========= HERE
            "language": "X", <========= HERE
            "tokenIssue": {
                "url": "/api/v1/token/issue",
            },
            "dataUrl": {
                "url": "/api/v1/nav/funds",
                "id": "XXXXXXXXXXXXXXXXXXXXXXXXXXXX" <========= HERE
            },
            ....
        }
    }
    

    您可以使用正则表达式提取 window.HSBC.dpas javascript 对象值,然后重新格式化字符串,使其变为有效的 JSON

    这些值随后会在 http 标头(例如 X-COUNTRYX-COMPONENTX-LANGUAGE)中传递给以下调用:

    GET https://www.assetmanagement.hsbc.de/api/v1/token/issue
    

    它直接返回JWT令牌并将Authorization标头添加到请求中作为Authorization: Bearer {token}

    GET https://www.assetmanagement.hsbc.de/api/v1/nav/funds
    

    例子:

    import requests
    import re
    import json
    
    api_url = "https://www.assetmanagement.hsbc.de/api/v1"
    funds_url=f"{api_url}/nav/funds"
    token_url = f"{api_url}/token/issue"
    
    # call the /fund-centre url to get the documentID value in the javascript
    url = "https://www.assetmanagement.hsbc.de/de/institutional-investors/fund-centre?f=Yes&n=-1&v=Documents"
    r = requests.get(url,
    params = {
        "f":"Yes",
        "n": "-1",
        "v": "Documents"
    })
    # this gets the javascript object
    res = re.search(r"^.*window\.HSBC\.dpas\s*=\s*([^;]*);", r.text, re.DOTALL)
    group = res.group(1)
    
    # convert to valid JSON: remove trailing commas: https://stackoverflow.com/a/56595068 (added "e")
    regex = r'''(?<=[}\]"'e]),(?!\s*[{["'])'''
    result_json = re.sub(regex, "", group, 0)
    
    result = json.loads(result_json)
    print(result["pageInformation"]["dataUrl"])
    
    # call /token/issue API to get a token
    r = requests.post(token_url,
    headers= {
        "X-Country": result["pageInformation"]["country"],
        "X-Component": result["pageInformation"]["dataUrl"]["id"],
        "X-Language": result["pageInformation"]["language"]
    }, data={})
    token = r.text
    print(token)
    
    # call /nav/funds API
    payload={
        "appliedFilters":[[{"active":True,"id":"Yes"}]],
        "paging":{"fundsPerPage":-1,"currentPage":1},
        "view":"Documents",
        "searchTerm":[],
        "selectedValues":[],
        "pageInformation": result["pageInformation"]
    }
    headers={
        "IFC-Cache-Header": "de,de,inst,documents,yes,1,n-1",
        "Authorization": f"Bearer {token}"
    }
    r = requests.post(funds_url,headers=headers,json=payload)
    print(r.content)
    

    Try this on repl.it

    【讨论】:

    • 你能指导我你是怎么想出来的吗?
    • @ImranaJabeen 当然,您可以打开 chrome 开发控制台并转到网络选项卡,拿起 api 调用并右键单击“复制为 curl”,然后在终端中执行命令,尝试删除一些标题,您可以缩小要求
    • 运行代码几次后它停止工作。就像现在它返回一个不完整的 html。
    • @ImranaJabeen 请查看上面的更新帖子
    猜你喜欢
    • 1970-01-01
    • 2023-03-23
    • 1970-01-01
    • 2019-07-27
    • 2016-01-20
    • 2021-03-24
    • 1970-01-01
    • 2018-06-23
    • 1970-01-01
    相关资源
    最近更新 更多