使用 BeautifulSoup 在标签内用 替换换行符答案

【问题标题】：Replacing line breaks with inside a tag using BeautifulSoup使用 BeautifulSoup 在标签内用 替换换行符
【发布时间】：2022-07-07 19:53:58
【问题描述】：

我想使用BeautifulSoup 解析一些HTML 并用  标记替换<blockquote> 标记内的任何换行符（\n）。这更加困难，因为<blockquote> 可能包含其他 HTML 标记。

我目前的尝试：

from bs4 import BeautifulSoup

html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""

soup = BeautifulSoup(html, "html.parser")

for element in soup.findAll():
    if element.name == "blockquote":
        new_content = BeautifulSoup(
            "<br>".join(element.get_text(strip=True).split("\n")).strip("<br>"),
            "html.parser",
        )
        element.string.replace_with(new_content)

print(str(soup))

输出应该是：

<p>Hello
there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>

然而，这个改编自this answer 的代码只有在<blockquote> 中没有HTML 标记时才有效。但是如果有（Line 3）那么element.string就是None，上面的就失败了。

有没有可以处理 HTML 标签的替代方案？

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

在使用replace() 时，选择更具体的元素并以string 处理元素本身要简单得多。

这样您就不必担心其他标签会以对象的形式出现，并且不会在get_text() 的结果中表示为字符串。

new_content = BeautifulSoup(
    str(element).replace('\n','<br>'),
    "html.parser",
)
element.replace_with(new_content)

示例

from bs4 import BeautifulSoup

html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""

soup = BeautifulSoup(html, "html.parser")

for element in soup.find_all('blockquote'):
    new_content = BeautifulSoup(
        str(element).replace('\n','<br>'),
        "html.parser",
    )
    element.replace_with(new_content)

print(str(soup))

输出

<p>Hello there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>

【讨论】：

【解决方案2】：

另一种方法是使用descendants 来查找NavigableStrings，然后只替换那些，不理会其他元素：

from bs4 import BeautifulSoup, NavigableString

html = """
<p>Hello
there</p>
<blockquote>Line 1
Line 2
<strong>Line 3</strong>
Line 4</blockquote>
"""

soup = BeautifulSoup(html, "html.parser")

for quote in soup.find_all("blockquote"):
    for element in list(quote.descendants):
        if isinstance(element, NavigableString):
            markup = element.string.replace("\n", "<br>")
            element.string.replace_with(BeautifulSoup(markup, "html.parser"))

print(str(soup))

输出：

<p>Hello
there</p>
<blockquote>Line 1<br/>Line 2<br/><strong>Line 3</strong><br/>Line 4</blockquote>

【讨论】：