使用 BeautifulSoup 或 re 从类的所有 <div> 标记中删除所有 <u> 和 <a> 标记答案

【问题标题】：Remove all the <u> and <a> tags from within all <div> tags of a class using BeautifulSoup or re使用 BeautifulSoup 或 re 从类的所有 <div> 标记中删除所有 <u> 和 <a> 标记
【发布时间】：2023-03-29 11:47:01
【问题描述】：

我正在尝试从 HTML 源中具有“sf-item”类的所有 DIV 标签中删除 <u> 和 <a> 标签，因为它们在从 Web url 抓取时破坏了文本。

（对于这个演示，我已经为 BeautifulSoup 方法分配了一个示例 html 字符串 - 但理想情况下它应该是一个作为源的 Web URL）

到目前为止，我已经尝试在下面的行中使用 re - 但我不确定如何在 re 中指定一个条件 - 仅在类 sf-item 的 DIV 标记内仅删除所有 <u /u> 之间的子字符串

data = re.sub('<u.*?u>', '', data)

还尝试使用下面的行从整个源中删除所有<u> 和<a> 标签，但不知何故它不起作用。我有点不确定如何仅在具有类 sf-item 的 DIV 标记中指定所有 <u> 和 <a> 标记。

for tag in soup.find_all('u'):
    tag.replaceWith('')

如果您能帮我实现这一点，不胜感激。

以下是有效的示例 Python 代码 -

from re import sub
from bs4 import BeautifulSoup
import re

data = """
<div class="sf-item"> The rabbit got to the halfway point at   
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle. 
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap. 
</div>
<div class="sf-item"> Even if the turtle passed him at 
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of 
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""

# data = re.sub('<u.*?u>', '', data)  ## This works for this particular string but I cannot use on a web url
# It would solve if I can somehow specify to remove <u> and <a> only within DIV of class sf-item

soup = BeautifulSoup(data, "html.parser")

for tag in soup.find_all('u'):
    tag.replaceWith('')

fResult = []
rMessage=soup.findAll("div",{'class':"sf-item"})

for result in rMessage:
    fResult.append(sub("&ldquo;|.&rdquo;","","".join(result.contents[0:1]).strip()))

fResult = list(filter(None, fResult))
print(fResult)

我从上面的代码得到的输出是

['The rabbit got to the halfway point at', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at']

但我需要如下输出 -

['The rabbit got to the halfway point at here However, it couldnt see the turtle.', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at Link. he would be able to race to the finish line ahead of place, he just kept going.']

【问题讨论】：

标签： python python-3.x beautifulsoup re

【解决方案1】：

BeautifulSoup 有一个内置方法，用于从标签中获取可见文本（即在浏览器中呈现时将显示的文本）。运行以下代码，我得到了预期的输出：

from re import sub
from bs4 import BeautifulSoup
import re

data = """
<div class="sf-item"> The rabbit got to the halfway point at   
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle. 
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap. 
</div>
<div class="sf-item"> Even if the turtle passed him at 
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of 
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""

soup = BeautifulSoup(data, "html.parser")

rMessage=soup.findAll("div",{'class':"sf-item"})

fResult = []

for result in rMessage:
    fResult.append(result.text.replace('\n', ''))

这会给你正确的输出，但有一些额外的空格。如果您想将它们全部减少为单个空格，可以通过以下方式运行 fResult：

fResult = [re.sub(' +', ' ', result) for result in fResult]

【讨论】：