如何使用 Python 遍历标签？答案

【问题标题】：How can I iterate through tags using Python?如何使用 Python 遍历标签？
【发布时间】：2026-01-03 15:00:01
【问题描述】：

我想遍历一些 html 并将数据存储到字典中。每次迭代都以：

<h1 class="docDisplay" id="docTitle">

我有以下代码：

html = '<html><body><h1 class="docDisplay" id="docTitle">Data1</h1><p>other data<\p><h1 class="docDisplay" id="docTitle">Data2</h1><p>other data2<\p></html>'

soup=BeautifulSoup(html)
newdoc = soup.find('h1', id="docTitle")
title = newdoc.findNext(text=True)
data = title.findAllNext('p',text=True)
data_dict = {}
data_dict[title] = {'data': data}
print data_dict

现在，输出是

{u'Data1': {'data': [u'other data<\\p>', u'Data2', u'other data2<\\p>']}}

我希望输出是：

{u'Data1': {'data': [u'other data<\\p>']}, u'Data2': {'data': [u'other data2<\\p>']}}

到达新的 h1 标签后，我不知道如何重新开始。有什么想法吗？

【问题讨论】：

你没有正确关闭你的。你也不要关闭你的身体标签。
我认为这就是他使用 BeautifulSoup 的原因。
您是否正在尝试构建一个字典，将标题名称映射到每个标题下的段落？如果是这种情况，您会遇到所有<p> 标记都是同级的问题，因此没有好的方法可以编写类似“在<div> 中找到<h1> 标记之后的<p> 标记” .您可能必须遍历 <body> 标记的子标签，跟踪您何时遇到标题和段落并对您的字典进行适当的更新。
@Josh Rosen：我明白你在说什么。当我使用 findNextAll 命令时，我得到所有 p 标记内容，因为它们是同级的。我不确定您所说的“遍历标记的子项，跟踪您何时遇到标题和段落并对字典进行适当的更新”是什么意思。你能详细说明一下吗？

标签： python tags beautifulsoup loops

【解决方案1】：

为了匹配每个标题下的段落文本，我会尝试这样的事情（您可能需要根据您想要的确切输出格式进行修改）：

    from BeautifulSoup import BeautifulSoup

    html = """ 
    <html>
    <head>
    </head>

    <body>
      <h1 class="docDisplay" id="docTitle">Data1</h1>
      <p>other data</p>
      <p>Another paragraph under the first heading.</p>
      <h1 class="docDisplay" id="docTitle">Data2</h1>
      <p>other data2</p>
      <div><p>This paragraph is NOT a sibling of the header</p></div>
    </body>
    </html>
"""

soup = BeautifulSoup(html)

data_dict = {}
stuff_under_current_heading = []

firstHeader = soup.find('h1', id="docTitle")
for tag in [firstHeader] + firstHeader.findNextSiblings():
    if tag.name == 'h1':
        stuff_under_current_heading = []
        # I chose to strip excess whitespace from the header name:
        data_dict[tag.string.strip()] = {'data': stuff_under_current_heading}
        # Modifying the list modifies the value in the dictionary.
    # Take every <p> tag encountered between here and the next heading
    # and associate it with the most recently-seen <h1> tag.
    elif tag.name == 'p':
        stuff_under_current_heading.append(tag.string)
    # Include <p> tags that are not siblings of the <h1> tag but
    # are still part of the content under the header.
    else:
        stuff_under_current_heading.extend(tag.findAll('p', text=True))

print data_dict

这个输出

{u'Data1': {'data': [u'other data', u'Another paragraph under the first heading.']},   
 u'Data2': {'data': [u'other data2', u'This paragraph is NOT a sibling of the header']}}

【讨论】：

谢谢乔希。这是完美的。

【解决方案2】：

@samplebias：@Lynch 是对的。如果 OP 没有正确关闭他/她的标签，他们根本无法期望解析器能够读懂他们的想法。

尝试修复您的 HTML，它可能会工作。 =)

【讨论】：

其实<p>标签在HTML中不需要关闭，BeautifulSoup会自动关闭。
我知道。我的意思是，如果您的 HTML 标记结构不正确，您将无法获得想要的结果！程序无法读懂你的想法！
查看这个问题的答案：*.com/questions/1261104/…。许多页面没有正确关闭其段落标签。 OP 可能无法控制他们正在解析的页面的 HTML；它可能来自第三方。
我知道很多页面没有关闭它们的标签，我意识到我可能有点不清楚。我的意思是，OP 不应该依赖程序来进行猜测。并不是说程序不会为他/她猜测。尽管我确实同意您对没有控制部分的看法是正确的。我没有考虑到这一点。 =)