有没有办法让动态网页在使用 Python 进行网页抓取时自动运行其 JavaScript？答案

【问题标题】：Is there a way to make a dynamic web page automatically run its JavaScript when webscraping with Python?有没有办法让动态网页在使用 Python 进行网页抓取时自动运行其 JavaScript？
【发布时间】：2021-12-08 16:17:56
【问题描述】：

我在尝试使用 BeautifulSoup 进行一些 Python 网页抓取时遇到了很多问题。由于这个特定的网页是动态的，我一直在尝试先使用 Selenium 来“打开”网页，然后再尝试使用 BeautifulSoup 处理动态内容。

我遇到的问题是，动态内容仅在我在运行程序时手动滚动浏览网站时才会显示在我的 HTML 输出中，否则这些部分HTML 的一部分保持为空，就好像我只是在没有 Selenium 的情况下单独使用 BeautifulSoup。

这是我的代码：

import time
from bs4 import BeautifulSoup
from selenium import webdriver

if __name__ == "__main__":

    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    # options.add_argument('--headless')

    driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe", chrome_options=options)
    driver.get('https://coinmarketcap.com/')
    time.sleep(5)

    html = driver.page_source

    soup = BeautifulSoup(html, "html.parser")
    tbody = soup.tbody
    trs = tbody.contents

    for tr in trs:
        print(tr)

    driver.close()

现在，如果我在打开无头选项的情况下使用 Selenium 打开 Chrome，我将获得与通常在没有预加载页面的情况下获得的相同的输出。如果我不在无头模式下，也会发生同样的事情，我只是让页面自行加载，而不需要手动滚动内容。有人知道为什么吗？有没有办法让动态内容加载而无需每次运行代码时手动滚动？

【问题讨论】：

标签： python selenium beautifulsoup

【解决方案1】：

其实数据是由javascipt动态加载的。这样您就可以轻松抓取数据从 api 调用 json 响应：

下面是工作示例：

代码：

import requests
import json

url= 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
r = requests.get(url)

for item in r.json()['data']['cryptoCurrencyList']:
    name = item['name']
    
    print('crypto_name:'  + str(name))

输出：

crypto_name:Bitcoin
crypto_name:Ethereum
crypto_name:Binance Coin     
crypto_name:Cardano
crypto_name:Tether
crypto_name:Solana
crypto_name:XRP
crypto_name:Polkadot
crypto_name:USD Coin
crypto_name:Dogecoin
crypto_name:Terra
crypto_name:Uniswap
crypto_name:Wrapped Bitcoin  
crypto_name:Litecoin
crypto_name:Avalanche        
crypto_name:Binance USD      
crypto_name:Chainlink        
crypto_name:Bitcoin Cash     
crypto_name:Algorand
crypto_name:SHIBA INU        
crypto_name:Polygon
crypto_name:Stellar
crypto_name:VeChain
crypto_name:Internet Computer
crypto_name:Cosmos
crypto_name:FTX Token
crypto_name:Filecoin
crypto_name:Axie Infinity
crypto_name:Ethereum Classic
crypto_name:TRON
crypto_name:Bitcoin BEP2
crypto_name:Dai
crypto_name:THETA
crypto_name:Tezos
crypto_name:Fantom
crypto_name:Hedera
crypto_name:NEAR Protocol
crypto_name:Elrond
crypto_name:Monero
crypto_name:Crypto.com Coin
crypto_name:PancakeSwap
crypto_name:EOS
crypto_name:The Graph
crypto_name:Flow
crypto_name:Aave
crypto_name:Klaytn
crypto_name:IOTA
crypto_name:eCash
crypto_name:Quant
crypto_name:Bitcoin SV
crypto_name:Neo
crypto_name:Kusama
crypto_name:UNUS SED LEO
crypto_name:Waves
crypto_name:Stacks
crypto_name:TerraUSD
crypto_name:Harmony
crypto_name:Maker
crypto_name:BitTorrent
crypto_name:Celo
crypto_name:Helium
crypto_name:OMG Network
crypto_name:THORChain
crypto_name:Dash
crypto_name:Amp
crypto_name:Zcash
crypto_name:Compound
crypto_name:Chiliz
crypto_name:Arweave
crypto_name:Holo
crypto_name:Decred
crypto_name:NEM
crypto_name:Theta Fuel
crypto_name:Enjin Coin
crypto_name:Revain
crypto_name:Huobi Token
crypto_name:OKB
crypto_name:Decentraland
crypto_name:SushiSwap
crypto_name:ICON
crypto_name:XDC Network
crypto_name:Qtum
crypto_name:TrueUSD
crypto_name:yearn.finance
crypto_name:Nexo
crypto_name:Celsius
crypto_name:Bitcoin Gold
crypto_name:Curve DAO Token
crypto_name:Mina
crypto_name:KuCoin Token
crypto_name:Zilliqa
crypto_name:Perpetual Protocol
crypto_name:Ren
crypto_name:dYdX
crypto_name:Ravencoin
crypto_name:Synthetix
crypto_name:renBTC
crypto_name:Telcoin
crypto_name:Basic Attention Token
crypto_name:Horizenput:

【讨论】：

嘿，感谢您的帮助，代码绝对可以帮我找到所有的名字。我将如何从中获取其他数据，比如价格？我正在尝试了解如何访问这些数据，但我很难理解我们在这种情况下使用的 dicts 的结构。
您可以随心所欲地抓取任何数据。要从后门获取数据，请求的 url 必须是 API url。那是 Api 网址。转到网络选项卡 > XHR/fetch > headers 然后你会看到这个 url，如果你点击预览然后你也可以看到数据。