【问题标题】:Beautiful Soup Scraping美丽的汤刮
【发布时间】:2021-01-15 08:28:06
【问题描述】:

我遇到了旧的工作代码无法正常运行的问题。

我的 python 代码正在使用漂亮的汤抓取网站并提取事件数据(日期、事件、链接)。

我的代码正在提取位于tbody 中的所有事件。每个事件都存储在<tr class="Box"> 中。问题是我的抓取工具似乎在此<tr style ="box-shadow: none;> 之后停止在它到达此部分(这是一个包含网站上我不想抓取的事件的 3 个广告的部分)代码停止从内部提取事件数据<tr class="Box">。有没有办法跳过这种 tr 风格/忽略未来的案例?

import pandas as pd
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

source = urllib.request.urlopen('https://10times.com/losangeles-us/technology/conferences').read()
soup = bs.BeautifulSoup(source,'html.parser')
   #---Get Event Data---
    test1=[]
    table = soup.find('tbody')
    table_rows = table.find_all('tr') #find table rows (tr)
    for x in table_rows:   
        data = x.find_all('td')  #find table data
        row = [x.text for x in data]
        if len(row) > 2: #Exlcudes rows with only event name/link, but no data.
            test1.append(row)
test1

【问题讨论】:

    标签: python-3.x python-2.7 web-scraping beautifulsoup


    【解决方案1】:

    数据是通过 JavaScript 动态加载的,因此您看不到更多结果。您可以使用此示例加载更多页面:

    import requests
    from bs4 import BeautifulSoup
    
    
    url = "https://10times.com/ajax?for=scroll&path=/losangeles-us/technology/conferences"
    params = {"page": 1, "ajax": 1}
    headers = {"X-Requested-With": "XMLHttpRequest"}
    
    for params["page"] in range(1, 4):  # <-- increase number of pages here
        print("Page {}..".format(params["page"]))
        soup = BeautifulSoup(
            requests.get(url, headers=headers, params=params).content,
            "html.parser",
        )
        for tr in soup.select('tr[class="box"]'):
            tds = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
            print(tds)
    

    打印:

    Page 1..
    ['Tue, 29 Sep - Thu, 01 Oct 2020', 'Lens Los Angeles', 'Intercontinental Los Angeles Downtown, Los Angeles', 'LENS brings together the entire Degreed community - our clients, invited prospective clients, thought leaders, partners, employees, executives, and industry experts for two days of discussion, workshops,...', 'Business Services IT & Technology', 'Interested']
    ['Wed, 30 Sep - Sat, 03 Oct 2020', 'FinCon', 'Long Beach Convention & Entertainment Center, Long Beach 20.1 Miles from Los Angeles', 'FinCon will be helping financial influencers and brands create better content, reach their audience, and make more money. Collaborate with other influencers who share your passion for making personal finance...', 'Banking & Finance IT & Technology', 'Interested 7 following']
    ['Mon, 05  - Wed, 07 Oct 2020', 'NetDiligence Cyber Risk Summit', 'Loews Santa Monica Beach Hotel, Santa Monica 14.6 Miles from Los Angeles', 'NetDiligence Cyber Risk Summit will conference are attended by hundreds of cyber risk insurance, legal/regulatory and security/privacy technology leaders from all over the world. Connect with leaders in...', 'IT & Technology', 'Interested']
    
    ... etc.
    

    【讨论】:

    • 感谢您的上述回答。我不知道这一点,但是,您的解决方案对我来说很有意义!
    猜你喜欢
    • 2020-12-13
    • 2019-03-13
    • 2014-05-28
    • 2020-09-28
    • 2021-11-19
    • 2018-12-29
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多