使用 beautifulsoup 访问嵌套元素答案

【问题标题】：Accessing nested elements with beautifulsoup使用 beautifulsoup 访问嵌套元素
【发布时间】：2017-10-15 20:16:25
【问题描述】：

我有以下html：

<div id="contentDiv">
    <!-- START FILER DIV -->
    <div style="margin: 15px 0 10px 0; padding: 3px; overflow: hidden; background-color: #BCD6F8;">
    <div class="mailer">Mailing Address
        <span class="mailerAddress">500 ORACLE PARKWAY</span>
        <span class="mailerAddress">MAIL STOP 5 OP 7</span>
        <span class="mailerAddress">REDWOOD CITY CA 94065</span>
     </div>

我正在尝试访问“500 ORACLE PARKWAY”和“MAIL STOP 5 OP &”，但找不到方法。我的尝试是这样的：

for item in soup.findAll("span", {"class" : "mailerAddress"}):
    if item.parent.name == 'div':
        return_list.append(item.contents)

编辑：我忘了提到在 html 中之后的元素使用相似的标签，所以当我只想要前两个时它会捕获所有这些。

编辑：链接：https://www.sec.gov/cgi-bin/browse-edgar?CIK=orcl

【问题讨论】：

您遇到什么样的错误？我试过你的代码，我可以看到你能够检索每个 span 元素中的文本。
能把HTML代码的链接发一下吗？
当您提供链接的页面上有一个非常好的 XML 文档时，您为什么要尝试解释 HTML：sec.gov/cgi-bin/…。 Beautiful Soup 永远应该是最后的选择。
不幸的是，我只能使用这个 html 而不能使用 XML 哈哈。

标签： python html beautifulsoup html-parsing

【解决方案1】：

试试这个：

from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.sec.gov/cgi-bin/browse-edgar?CIK=orcl").text
soup = BeautifulSoup(res,'lxml')
for item in soup.find_all(class_="mailerAddress")[:2]:
    print(item.text)

结果：

500 ORACLE PARKWAY
MAIL STOP 5 OP 7

【讨论】：

【解决方案2】：

我将尝试用我们掌握的一点点信息来回答这个问题。如果您只想要网页上某个类的前两个元素，您可以使用切片。

soup.findAll("span", {"class" : "mailerAddress"})[0:2]

【讨论】：