使用 beautifulsoup 抓取网页的困难答案

【问题标题】：Difficulties in webscraping using beautifulsoup使用 beautifulsoup 抓取网页的困难
【发布时间】：2020-06-12 05:16:45
【问题描述】：

我正在尝试使用 Python Jupyter 中的 beautifulsoup 从以下网站抓取价格。我想要的元素有一个独特的“平均价格”类。我尝试使用 findall 功能，但无法抓取它。有人可以帮我看看有什么问题吗？

网址：https://otc.hbg.com/en-us/trade/buy-usdt/

import requests
URL = 'https://otc.hbg.com/en-us/trade/buy-btc/' 
page = requests.get(URL)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
containers = soup.find_all("div", {"class": "price average"})
containers

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

数据通过 JavaScript 动态加载。但是你可以使用requests模块来解析必要的信息：

import json
import requests


url = 'https://otc-api.hbg.com/v1/data/trade-market?coinId=1&currency=3&tradeType=sell&currPage=1&payMethod=0&country=153&blockType=general&online=1&range=0&amount='
data = requests.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for d in data['data']:
    print('{:<30}{}'.format(d['userName'], d['price']))

打印：

CRXzone.com                   13286.95
CRXzone.com                   13352.66
cryptotil                     13352.66
coinhub                       13365.81
btcsg                         13470.94
108057692                     13470.94
yjyjyj                        13536.66
Silkroad1015                  13668.08
btcsg                         14193.78
digicryp                      16427.98

【讨论】：

如何使 URL 动态化，因为 URL 每天都在变化
我需要使用api吗？
@HanishKiran 数据是动态加载的，所以你可以使用selenium 并用它来抓取数据。其他网址是什么？
好的！数字会定期刷新，我认为只要数据发生变化，URL就会发生变化
@HanishKiran 您可以将 URL 缩短为 https://otc-api.hbg.com/v1/data/trade-market?coinId=1&currency=3&tradeType=sell&blockType=general 或许有帮助。

【解决方案2】：

您想要实现的目标无法在网络抓取工具的帮助下完成，因为您想要抓取的元素甚至需要额外的用户交互才能出现在页面上。使用 Selenium 研究网络自动化以实现您想要的结果。

【讨论】：

【解决方案3】：

你不需要硒。只需查询他们的API：

import requests

url = 'https://otc-api-hk.eiijo.cn/v1/data/trade-market?coinId=2&currency=4&tradeType=sell&currPage=1&payMethod=0&country=74&blockType=general&online=1&range=0&amount='

resp = requests.get(url).json()

for i in resp['data']:
    print(i['price'])

【讨论】：

如何使 URL 动态化，因为 URL 不是静态的，它每天都在变化