美丽的汤：提取两个标签之间的所有内容答案

【问题标题】：Beautiful soup: Extract everything between two tags美丽的汤：提取两个标签之间的所有内容
【发布时间】：2020-09-15 14:05:35
【问题描述】：

我正在使用 BeautifulSoup 从 HTML 文件中提取数据。我想获取两个标签之间的所有信息。这意味着如果我有这样的 HTML 部分：

<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>

如果我想要第一个 h1 和第二个 h1 之间的所有信息，输出将如下所示：

Text <i>here</i> has no tag
<div>This is in a div</div>

我尝试过 nextsibling 循环，但似乎总是有问题。 Beautifulsoup 中是否有一个命令可以简单地提取元素“A”和元素“B”之间的所有内容（文本、换行符、div、特殊字符）？

【问题讨论】：

需要更多的 sn-ps！不过说真的，当您提出问题时，您需要发布代码以便我们提供指导。
你是对的。我在手机上，我的电脑上没有互联网接入。我在俄勒冈州的火灾附近，所以一切都很糟糕。我只是想知道 beautifulsoup 中是否有为此的命令，或者我是否应该坚持抨击 pcregrep。

标签： python html beautifulsoup

【解决方案1】：

一个解决方案是.extract()所有内容在第一个<h1>之前和第二个<h1>标签之后：

from bs4 import BeautifulSoup


html_doc = '''
This I <b>don't</b> want
<h1></h1>
Text <i>here</i> has no tag
<div>This is in a div</div>
<h1></h1>
This I <b>don't</b> want too
'''

soup = BeautifulSoup(html_doc, 'html.parser')

for c in list(soup.contents):
    if c is soup.h1 or c.find_previous('h1') is soup.h1:
        continue
    c.extract()

for h1 in soup.select('h1'):
    h1.extract()

print(soup)

打印：

Text <i>here</i> has no tag
<div>This is in a div</div>

【讨论】：

非常感谢。祝你有美好的一天。

【解决方案2】：

这里是如何，你可以简单地定位他们的父母，或者你可以将他们包装在容器中并提取您要定位的父级的所有子级，如下所示

from bs4 import BeautifulSoup
content = """
<div class="container">
    <h1></h1>
        Text <i>here</i> has no tag
        <div>This is in a div</div>
    <h1></h1>
</div>
"""
soup = BeautifulSoup(content, 'html.parser')
results = soup.find('div').findChildren()
print(results)

或

print(soup.find('h1').findAllNext())

【讨论】：

非常感谢您的帮助。尽管 Beautifulsoup 听起来很棒，但我认为我现在使用 bash 脚本更舒服。
很高兴帮助@MrNo