无法使用 BeautifulSoup 访问 HTML 子标签答案

【问题标题】：Can't access HTML child tags with BeautifulSoup无法使用 BeautifulSoup 访问 HTML 子标签
【发布时间】：2021-08-02 05:31:20
【问题描述】：

我正在尝试从 CNN 网站访问文章元数据。他们的“头条新闻”部分位于这样开头的标签下方：

<section class="zn zn-homepage1-zone-1....

在该部分下方，每篇文章都位于如下标签内：

<article class="cd cd--card cd--article....

在类似的网站上，我可以通过以下方法访问“头条新闻”故事：

cnnUrl = "https://www.cnn.com"
cnnSoup = BeautifulSoup(requests.get(cnnUrl, headers=headers).content, "html.parser")

homepageZone1 = '[class*="zn zn-homepage1-zone-1"]'

for item in cnnSoup.select(homepageZone1):

...for 循环让我可以访问子标签，我可以在其中收集我需要的数据。一旦我拥有item，我通常可以为 CNN 的头条新闻标题文本执行类似的操作（这种格式会不时变化）：

headline = item.find('h2').get_text()

headline 的位置（截至目前）：

国家培养皿

但是，在这种情况下，我得到了 None 类型的 homepageZone1 标签。我尝试退回到 homepageZone1 的父 div：

cnnEverything = '[class*="pg-no-rail pg-wrapper"]'

for item in cnnSoup.select(cnnEverything):

Item 这里给了我以下子标签，但这些标签实际上都没有我可以访问的子标签：

<div class="pg-no-rail pg-wrapper"><div class="pg__background__image_wrapper"></div><div class="l-container"></div><section class="zn--idx-0 zn-empty"> </section><section class="zn--idx-1 zn-empty"> </section><section class="zn--idx-2 zn-empty"> </section><section class="zn--idx-3 zn-empty"> </section><section class="zn--idx-4 zn-empty"> </section><section class="zn--idx-5 zn-empty"> </section><section class="zn--idx-6 zn-empty"> </section><section class="zn--idx-7 zn-empty"> </section><section class="zn--idx-8 zn-empty"> </section><section class="zn--idx-9 zn-empty"> </section><section class="zn--idx-10 zn-empty"> </section><div class="ad ad--epic ad--all t-dark"><div class="ad-ad_bnr_btf_02 ad-refresh-adbody" data-ad-id="ad_bnr_btf_02" id="ad_bnr_btf_02"></div></div></div>

我错过了什么？

【问题讨论】：

预期的输出是什么，你能把它包含在你的帖子中吗？
@sushanth 已更新。
您确定在返回的 HTML 中存在具有指定类的
@AndyKnight 有趣的是，事实并非如此。但是，如果我使用浏览器的 Web 检查器检查标题，我会看到该类。为什么会这样？

标签： python html web-scraping beautifulsoup

【解决方案1】：

我认为您需要的 HTML 是在单独的请求中请求的，然后使用 Javascript 将其添加到主 HTML 中（这就是您看不到它的原因）。

下面展示了如何从返回的 JSON 中的 HTML 请求国际版本：

from bs4 import BeautifulSoup
import requests

# International version
r = requests.get("https://edition.cnn.com/data/ocs/section/index.html:intl_homepage1-zone-1/views/zones/common/zone-manager.izl")
json_data = r.json()
html = json_data['html'].replace(r'\"', '"')
cnnSoup = BeautifulSoup(html, 'html.parser')

for heading in cnnSoup.find_all(['h2', 'h3']):
    print(heading.text)

为您提供以下标题：

Kandahar falls to Taliban
Militants take control of Afghanistan's second-largest city during an unrelenting sweep of the country, weeks before US troops are due to complete withdrawal
LIVE: UK defense chief worried about potential return of al Qaeda
Video allegedly shows Taliban celebrating after Kandahar gain
Afghanistan's quick unraveling threatens to stain Biden's legacy
...

通过查看页面加载时浏览器发出的请求找到该 URL。

【讨论】：