【问题标题】:Python web scraping HTML with same class具有相同类的 Python Web 抓取 HTML
【发布时间】:2020-12-03 00:53:50
【问题描述】:

我想问一下如何使用 python 库 (beautifulSoup) 从 this website 提取活动费用以进行网络抓取。

但是,该活动的费用与其他物业共享同一级别。我想问是否有任何建议只提取费用。我尝试了find_nextfind_next_siblingfind next_parent,但仍然没有用。下面是价格类所在的原始 html 代码:

<div class="eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar-top-1 eds-event-card-content__sub--cropped">Free</div>

如果能提供任何帮助,我将不胜感激。

以下是我尝试过的代码。我只在我的数组中得到一个标签列表。

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=1'

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

#Finding common container for each event
containers = soup.find_all('article', class_ = 'eds-l-pad-all-4 eds-event-card-content eds-event-card-content--list eds-event-card-content--standard eds-event-card-content--fixed eds-l-pad-vert-3')

event_fees = []

for container in containers:
        fees = soup.select('div', class_ ='eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar-top-1 eds-event-card-content__sub--cropped')
        event_fees.append(fees.txt)

【问题讨论】:

    标签: python web beautifulsoup


    【解决方案1】:

    有关价格的数据是从外部 URL 加载的。你可以使用requests/json模块来获取它:

    import re
    import json
    import requests
    
    
    url = "https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=1"
    events_url = 'https://www.eventbrite.com/api/v3/destination/events/?event_ids={event_ids}&expand=event_sales_status,primary_venue,image,saves,my_collections,ticket_availability&page_size=99999'
    html_text = requests.get(url).text
    
    data1 = json.loads( re.search(r'window\.__SERVER_DATA__ = ({.*});', html_text).group(1) )
    
    # uncomment this to print all data:
    # print(json.dumps(data1, indent=4))
    
    event_ids = ','.join(r['id'] for r in data1['search_data']['events']['results'])
    data2 = requests.get(events_url.format(event_ids=event_ids)).json()
    
    # uncomment this to print all data:
    # print(json.dumps(data2, indent=4))
    
    for e in data2['events']:
        print(e['name'])
        print(e['ticket_availability']['minimum_ticket_price']['display'],'-',e['ticket_availability']['maximum_ticket_price']['display'])
        print('-' * 80)
    

    打印:

    Mega Career Fair & Post Graduate Education Fair 2020 - Mid Valley KL
    0.00 MYR - 0.00 MYR
    --------------------------------------------------------------------------------
    Post Graduate Education Fair 2020 - Mid Valley KL
    0.00 MYR - 0.00 MYR
    --------------------------------------------------------------------------------
    Traders Fair 2021 - Malaysia (Financial Education Event)
    0.00 USD - 199.00 USD
    --------------------------------------------------------------------------------
    THE FIT Malaysia
    0.00 MYR - 0.00 MYR
    --------------------------------------------------------------------------------
    Walk-In Interview with Career Partners of HRDF
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    Entrepreneurship for Beginners - Startup | Entrepreneur Hackathon Webinar
    0.00 EUR - 0.00 EUR
    --------------------------------------------------------------------------------
    Good Shepherd Catholic Church  English Mass Registration- Scroll Down  pls
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    CGH 10:00am Assumption Mass Registration
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    Kuala Lumpu Video Speed Dating - Filter Off
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    Wiki Finance EXPO Kuala Lumpur 2021
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    English Sunday Service - 16 AUGUST
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    Good Shepherd Catholic  Bahasa Malaysia Mass Registration. Pls scroll down
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    How To Improve Your Focus and Limit Distractions - Kuala Lumpur
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    ANNUAL GENERAL MEETING
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    ITS ALL ABOUT PORTRAIT
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    First service (English)
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    KL International Flea Market 2020 / Bazaar Antarabangsa Kuala Lumpur
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    Branding Strategies For Startups
    10.50 MYR - 31.50 MYR
    --------------------------------------------------------------------------------
    SHC 9.15am Sunday Mass Registration
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    SHC 9.15am Sunday Mass (Tamil) திருஇருதய ஆண்டவர் ஆலயத்தில்  காலை  9.15க்கு
    0.00 USD - 0.00 USD
    --------------------------------------------------------------------------------
    

    【讨论】:

    • 如何查看外部网址?请解释一下如何获取 data1、events_id 和 data2。我真的不明白代码。谢谢。
    • @ShawnTeh 当您在代码中执行print(soup) 时,您会看到没有价格 - 因此页面必须通过 JavaScript 动态获取它们。该页面通过 Json 从外部 URL 加载信息 - 我在打开 Firefox 开发人员工具 -> 网络选项卡并在那里搜索定价信息时发现了该 URL。
    • 如何在谷歌浏览器下搜索价格?我看到谷歌浏览器开发者工具下也有网络标签。
    • @ShawnTeh 查看页面在哪里发出 Json 类型请求并在响应中搜索。
    • 能否请您解释一下逻辑并注释代码,以便我尝试理解它。谢谢。
    猜你喜欢
    • 2015-12-10
    • 1970-01-01
    • 2022-12-24
    • 2023-04-08
    • 1970-01-01
    • 2023-03-05
    • 1970-01-01
    • 1970-01-01
    • 2022-01-13
    相关资源
    最近更新 更多