【问题标题】:beautifulsoup4 keeps printing nonebeautifulsoup4 一直打印无
【发布时间】:2026-02-22 13:20:15
【问题描述】:

HTML

<div class="secondary">
            <dl>
                <div><dt>Joined</dt><dd><span class="relative-date date" title="Nov 2, 2019 9:24 pm" data-time="1572701042645" data-format="medium">Nov 2, '19</span></dd></div>
                <div><dt>Last Post</dt><dd><span class="relative-date date" title="Nov 1, 2020 4:21 pm" data-time="1604218868661" data-format="medium">18 hours</span></dd></div>
                <div><dt>Seen</dt><dd><span class="relative-date date" title="Nov 2, 2020 10:38 am" data-time="1604284735243" data-format="medium">12 mins</span></dd></div>
                <div><dt>Views</dt><dd>546</dd></div>
<!---->                <div><dt class="trust-level">Trust Level</dt><dd class="trust-level">Member</dd></div>
<!---->                <div><dt class="groups">Groups</dt>
                <dd class="groups">
                    <span><a href="/g/Programmers" id="ember47" class="group-link ember-view">Programmers</a></span>
                    <span><a href="/g/Web_Developer" id="ember49" class="group-link ember-view">Web_Developer</a></span>

<a href="/g?username=OctaLua" id="ember50" class="ember-view">                    ...
</a>                </dd>
                </div>

<!---->            </dl>
            <span id="ember51" class="ember-view">  <div id="ember53" class="user-profile-secondary-outlet follow-statistics-user ember-view"><!----></div>
</span>
          </div>

所以我正在尝试使用 Python BeautifulSoup4 库获取“二级”类

page = requests.get('https://devforum.roblox.com/u/octalua').content
soup = BeautifulSoup(page, 'html.parser')
content = soup.find('div', {'class': 'secondary'})

print(content)

但是,每当我打印内容时,即使我已经定义了类,它仍然不打印任何内容,如果您希望在 python 代码中检查其 URL,谢谢。

【问题讨论】:

    标签: python html web beautifulsoup


    【解决方案1】:

    网页的那部分是动态加载的,所以你必须使用selenium才能抓取它:

    from bs4 import BeautifulSoup
    from selenium import webdriver
    import time
    
    driver = webdriver.Chrome()
    driver.get('https://devforum.roblox.com/u/octalua')
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    content = soup.find('div', {'class': 'secondary'})
    
    print(content)
    
    driver.close()
    

    输出:

    <div class="secondary">
    <dl>
    <div><dt>Joined</dt><dd><span class="relative-date date" data-format="medium" data-time="1572701042645" title="Nov 2, 2019 6:54 pm">Nov 2, '19</span></dd></div>
    <div><dt>Last Post</dt><dd><span class="relative-date date" data-format="medium" data-time="1604218868661" title="Nov 1, 2020 1:51 pm">19 hours</span></dd></div>
    <div><dt>Seen</dt><dd><span class="relative-date date" data-format="medium" data-time="1604284735243" title="Nov 2, 2020 8:08 am">19 mins</span></dd></div>
    <div><dt>Views</dt><dd>550</dd></div>
    <!-- --> <div><dt class="trust-level">Trust Level</dt><dd class="trust-level">Member</dd></div>
    <!-- --> <div><dt class="groups">Groups</dt>
    <dd class="groups">
    <span><a class="group-link ember-view" href="/g/Programmers" id="ember47">Programmers</a></span>
    <span><a class="group-link ember-view" href="/g/Web_Developer" id="ember49">Web_Developer</a></span>
    <a class="ember-view" href="/g?username=OctaLua" id="ember50">                    ...
    </a> </dd>
    </div>
    <!-- --> </dl>
    <span class="ember-view" id="ember51"> <div class="user-profile-secondary-outlet follow-statistics-user ember-view" id="ember53"><!-- --></div>
    </span>
    </div>
    

    编辑:

    您也可以使用json 文件执行相同的操作。代码如下:

    import requests
    import pandas as pd
    
    dictt = requests.get('https://devforum.roblox.com/u/octalua/summary.json').json()
    
    lst = dictt['topics']
    
    final = {}
    
    needed_keys = ["id","posts_count","reply_count","last_posted_at"]
    
    for dictionary in lst:
        for key in dictionary.keys():
            if key in needed_keys:
                if set(needed_keys).issubset(dictionary.keys()):
                    final.setdefault(key,[]).append(dictionary[key])
                else:
                    if key not in dictionary.keys():
                        final.setdefault(key, []).append(float("nan"))
    
    df = pd.DataFrame(final,index=final['id'])
    df = df.drop('id', axis = 1)
    print(df)
    

    输出:

            posts_count  reply_count            last_posted_at
    777375            5            1  2020-09-19T10:09:30.064Z
    571759            9            6  2020-05-14T12:15:38.374Z
    626599            9            4  2020-06-15T17:24:31.469Z
    610010            4            0  2020-06-04T07:24:15.153Z
    593138            2            1  2020-06-01T12:01:21.984Z
    548304            4            0  2020-04-29T14:11:44.803Z
    830091            2            0  2020-10-21T04:27:50.161Z
    606410           25           23  2020-08-14T22:22:59.322Z
    612874            7            4  2020-08-29T05:48:49.863Z
    841094           11            5  2020-10-28T12:55:10.337Z
    841110            7            4  2020-10-29T17:25:40.995Z
    419774         4813         1983  2020-11-02T04:31:40.577Z
    607078           10            6  2020-06-03T14:35:40.271Z
    831553           11            6  2020-10-22T16:07:17.877Z
    

    【讨论】:

    • 你必须下载 selenium 和它的 webdriver*
    • 是的。我认为 OP 安装了seleniumchromedriver,因为他问了很多与selenium 相关的问题
    • 动态加载是什么意思?
    • 不确定这是否是它的工作原理,但我检查了网络并找到了这个devforum.roblox.com/u/octalua/summary.json,所以我认为我可以向它发送一个 get 请求,然后将其转换为 html
    • 发帖次数、回复次数、最后发帖次数。我打印了输出,但很遗憾,它包含那些东西,因为它太长了,所以我无法发送它。
    【解决方案2】:

    这应该可以工作,因为它是动态加载的。

    driver.get('https://devforum.roblox.com/u/octalua')
    elem=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "secondary")))
    print(elem.text)#or .content
    

    导入

    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait 
    from selenium.webdriver.support import expected_conditions as EC
    from selenium import webdriver
    

    输出

    Joined
    Nov 2, '19
    Last Post
    19 hours
    Seen
    26 mins
    Views
    556
    Trust Level
    Member
    Groups
    Programmers Web_Developer ...
    

    【讨论】: