网页抓取一个没有出现在beautifulsoup中的元素答案

【问题标题】：Web Scraping an element that does not appear in beatifulsoup网页抓取一个没有出现在beautifulsoup中的元素
【发布时间】：2020-07-23 19:48:47
【问题描述】：

我正在尝试从Udacity Catalog Webiste 中提取所有课程信息。
当我尝试从任何课程页面中提取价格时，它会返回一个 null months access 和一个空值，如下所示：
the Data Analyst course in the example

page_req = requests.get('https://www.udacity.com/course/data-analyst-nanodegree--nd002')
page_soup = BeautifulSoup(page_req.content, 'html.parser')
page_soup.find('div', class_='price-cards').find('div', class_='price-card bundle')

<div class="price-card bundle"><div class="flag"><p class="flag__text">10% OFF</p></div><div 
class="price-info"><div class="price-info__deal" hidden="">BEST DEAL</div><div class="title h6">null 
months access</div><div class="price"><span class="price__payable"><span class="skeleton 
skeleton__default"><span style="width:100px"> </span></span></span><span class="price__label"><span 
class="current-price"> per month</span></span></div><p class="blurb">Start learning today! Switch to 
the monthly price afterwards if more time is needed.</p><div class="enroll-button__container"></div> 
</div></div>

那么我怎样才能知道课程的价格呢？

注意：价格因国家/地区而异。（即：在美国是美元和意大利是欧元）

【问题讨论】：

标签： python web-scraping beautifulsoup request

【解决方案1】：

网络抓取现代网站的最简单方法是观察网络流量。您可以通过打开浏览器的开发者工具 [或按 Ctrl + Shift + I] 来执行此操作。选择网络，标记保留日志和禁用缓存。下一个仅过滤 XHR。重新加载页面并观察网络调用。

当我调用您的 URI 时，Web 浏览器对 Udacity URI 进行了 GET 调用。使用 Python 模拟该调用：

from requests import Session

with Session() as httpx:

    URI = 'https://braavos.udacity.com/api/prices'

    params = dict(item='urn:x-udacity:item:nd-unit:10153',
                  price_sheet='regular',
                  currency='USD',
                 )
    response = httpx.get(url=URI, params=params)
    data = response.json()

print(type(data)) # dict
print(data) # dict and thus you can access data as you would dicts

# examples
print(data['results'][0]['payment_plans']['upfront_recurring']['description'])
# 'one time payment of $1,436 USD, followed by $399 USD every 1 month'

print(data['results'][0]['payment_plans']['recurring']['description'])
# '$399 USD every 1 month'

【讨论】：

成功了，感谢您的帮助。但我想提取 250 课程的所有价格。通过这种方法，我必须打开每个课程页面以获取nd-unit 以将其传递到requests.Session.get()。有没有什么方法可以自动获取每门课程的所有nd-unit变量来提取价格？

【解决方案2】：

试试下面的脚本。我已经使用 API 方式实现了 https://www.udacity.com/course/data-analyst-nanodegree--nd002 的脚本，这是从端点获取数据的最佳方式之一。您可以通过检查开发人员工具的网络部分来做到这一点，只需按CTRL+SHIFT+I 并在 XHR 上的网络过滤器中查看所有 API 调用。使用请求，您可以点击 API，它会发回您必须转换为 JSON 格式的结果。

使用 API url 请求的好处：

代码更少。
可靠且错误更少。
得到相当快的响应。
易于访问。

如果您查看脚本，它现在正在提取您想要抓取重复和预先重复的付款计划。以同样的方式您可以访问 JSON 结果中的任何内容，例如：促销、原始价格等。此外，我已经使 URL 动态化，您可以传递任何国家/地区货币的缩写，它会给您结果。例如：- 对于美国，我已传递 USD，对于意大利或欧洲，您可以在 capital only 的货币变量中传递 EUR。。 p>

import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

def scrape_prices():    

    currency = 'USD'
    url = 'https://braavos.udacity.com/api/prices?item=urn:x-udacity:item:nd-unit:10153&price_sheet=regular&anonymous_id=ae9be6e5-97af-48ee-ab3d-63456a8cb38f&currency=' + currency 
    session = requests.Session()
    response = session.get(url,verify=False)
    result = json.loads(response.text) 
    extracted_payment_plans_recurring = result['results'][0]['payment_plans']['recurring']
    extracted_payment_plan_upfront = result['results'][0]['payment_plans']['upfront_recurring']
    print('-' * 100)
    print('Payment Plans Recurring: ',extracted_payment_plans_recurring)
    print('-' * 100)
    print('Payment Plans Up front Recurring: ',extracted_payment_plan_upfront)
    print('-' * 100)

定期付款计划结果预付定期付款计划结果 UDACITY 网站的 API 网址价格的 JSON 结果

【讨论】：

成功了，感谢您的帮助。但我想提取 250 课程的所有价格。通过这种方法，我必须打开每个课程页面以获取nd-unit 以将其传递到requests.Session.get()。有没有什么方法可以自动获取每门课程的所有nd-unit变量来提取价格？
你能提供包含所有课程列表的链接吗？