【问题标题】:Is there a way to make a dynamic web page automatically run its JavaScript when webscraping with Python?有没有办法让动态网页在使用 Python 进行网页抓取时自动运行其 JavaScript?
【发布时间】:2021-12-08 16:17:56
【问题描述】:

我在尝试使用 BeautifulSoup 进行一些 Python 网页抓取时遇到了很多问题。由于这个特定的网页是动态的,我一直在尝试先使用 Selenium 来“打开”网页,然后再尝试使用 BeautifulSoup 处理动态内容。

我遇到的问题是,动态内容仅在我在运行程序时手动滚动浏览网站时才会显示在我的 HTML 输出中否则这些部分HTML 的一部分保持为空,就好像我只是在没有 Selenium 的情况下单独使用 BeautifulSoup。

这是我的代码:

import time
from bs4 import BeautifulSoup
from selenium import webdriver

if __name__ == "__main__":

    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    # options.add_argument('--headless')

    driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe", chrome_options=options)
    driver.get('https://coinmarketcap.com/')
    time.sleep(5)

    html = driver.page_source

    soup = BeautifulSoup(html, "html.parser")
    tbody = soup.tbody
    trs = tbody.contents

    for tr in trs:
        print(tr)

    driver.close()

现在,如果我在打开无头选项的情况下使用 Selenium 打开 Chrome,我将获得与通常在没有预加载页面的情况下获得的相同的输出。如果我不在无头模式下,也会发生同样的事情,我只是让页面自行加载,而不需要手动滚动内容。 有人知道为什么吗?有没有办法让动态内容加载而无需每次运行代码时手动滚动?

【问题讨论】:

    标签: python selenium beautifulsoup


    【解决方案1】:

    其实数据是由javascipt动态加载的。这样您就可以轻松抓取数据 从 api 调用 json 响应:

    下面是工作示例:

    代码:

    import requests
    import json
    
    url= 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
    r = requests.get(url)
    
    for item in r.json()['data']['cryptoCurrencyList']:
        name = item['name']
        
        print('crypto_name:'  + str(name)) 
    

    输出:

    crypto_name:Bitcoin
    crypto_name:Ethereum
    crypto_name:Binance Coin     
    crypto_name:Cardano
    crypto_name:Tether
    crypto_name:Solana
    crypto_name:XRP
    crypto_name:Polkadot
    crypto_name:USD Coin
    crypto_name:Dogecoin
    crypto_name:Terra
    crypto_name:Uniswap
    crypto_name:Wrapped Bitcoin  
    crypto_name:Litecoin
    crypto_name:Avalanche        
    crypto_name:Binance USD      
    crypto_name:Chainlink        
    crypto_name:Bitcoin Cash     
    crypto_name:Algorand
    crypto_name:SHIBA INU        
    crypto_name:Polygon
    crypto_name:Stellar
    crypto_name:VeChain
    crypto_name:Internet Computer
    crypto_name:Cosmos
    crypto_name:FTX Token
    crypto_name:Filecoin
    crypto_name:Axie Infinity
    crypto_name:Ethereum Classic
    crypto_name:TRON
    crypto_name:Bitcoin BEP2
    crypto_name:Dai
    crypto_name:THETA
    crypto_name:Tezos
    crypto_name:Fantom
    crypto_name:Hedera
    crypto_name:NEAR Protocol
    crypto_name:Elrond
    crypto_name:Monero
    crypto_name:Crypto.com Coin
    crypto_name:PancakeSwap
    crypto_name:EOS
    crypto_name:The Graph
    crypto_name:Flow
    crypto_name:Aave
    crypto_name:Klaytn
    crypto_name:IOTA
    crypto_name:eCash
    crypto_name:Quant
    crypto_name:Bitcoin SV
    crypto_name:Neo
    crypto_name:Kusama
    crypto_name:UNUS SED LEO
    crypto_name:Waves
    crypto_name:Stacks
    crypto_name:TerraUSD
    crypto_name:Harmony
    crypto_name:Maker
    crypto_name:BitTorrent
    crypto_name:Celo
    crypto_name:Helium
    crypto_name:OMG Network
    crypto_name:THORChain
    crypto_name:Dash
    crypto_name:Amp
    crypto_name:Zcash
    crypto_name:Compound
    crypto_name:Chiliz
    crypto_name:Arweave
    crypto_name:Holo
    crypto_name:Decred
    crypto_name:NEM
    crypto_name:Theta Fuel
    crypto_name:Enjin Coin
    crypto_name:Revain
    crypto_name:Huobi Token
    crypto_name:OKB
    crypto_name:Decentraland
    crypto_name:SushiSwap
    crypto_name:ICON
    crypto_name:XDC Network
    crypto_name:Qtum
    crypto_name:TrueUSD
    crypto_name:yearn.finance
    crypto_name:Nexo
    crypto_name:Celsius
    crypto_name:Bitcoin Gold
    crypto_name:Curve DAO Token
    crypto_name:Mina
    crypto_name:KuCoin Token
    crypto_name:Zilliqa
    crypto_name:Perpetual Protocol
    crypto_name:Ren
    crypto_name:dYdX
    crypto_name:Ravencoin
    crypto_name:Synthetix
    crypto_name:renBTC
    crypto_name:Telcoin
    crypto_name:Basic Attention Token
    crypto_name:Horizenput:
    

    【讨论】:

    • 嘿,感谢您的帮助,代码绝对可以帮我找到所有的名字。我将如何从中获取其他数据,比如价格?我正在尝试了解如何访问这些数据,但我很难理解我们在这种情况下使用的 dicts 的结构。
    • 您可以随心所欲地抓取任何数据。要从后门获取数据,请求的 url 必须是 API url。那是 Api 网址。转到网络选项卡 > XHR/fetch > headers 然后你会看到这个 url,如果你点击预览然后你也可以看到数据。
    猜你喜欢
    • 2021-11-29
    • 1970-01-01
    • 2015-02-27
    • 2021-11-15
    • 2019-01-29
    • 1970-01-01
    • 2021-03-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多