【发布时间】:2023-03-29 11:47:01
【问题描述】:
我正在尝试从 HTML 源中具有“sf-item”类的所有 DIV 标签中删除 <u> 和 <a> 标签,因为它们在从 Web url 抓取时破坏了文本。
(对于这个演示,我已经为 BeautifulSoup 方法分配了一个示例 html 字符串 - 但理想情况下它应该是一个作为源的 Web URL)
到目前为止,我已经尝试在下面的行中使用 re - 但我不确定如何在 re 中指定一个条件 - 仅在类 sf-item 的 DIV 标记内仅删除所有 <u /u> 之间的子字符串
data = re.sub('<u.*?u>', '', data)
还尝试使用下面的行从整个源中删除所有<u> 和<a> 标签,但不知何故它不起作用。我有点不确定如何仅在具有类 sf-item 的 DIV 标记中指定所有 <u> 和 <a> 标记。
for tag in soup.find_all('u'):
tag.replaceWith('')
如果您能帮我实现这一点,不胜感激。
以下是有效的示例 Python 代码 -
from re import sub
from bs4 import BeautifulSoup
import re
data = """
<div class="sf-item"> The rabbit got to the halfway point at
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle.
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap.
</div>
<div class="sf-item"> Even if the turtle passed him at
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""
# data = re.sub('<u.*?u>', '', data) ## This works for this particular string but I cannot use on a web url
# It would solve if I can somehow specify to remove <u> and <a> only within DIV of class sf-item
soup = BeautifulSoup(data, "html.parser")
for tag in soup.find_all('u'):
tag.replaceWith('')
fResult = []
rMessage=soup.findAll("div",{'class':"sf-item"})
for result in rMessage:
fResult.append(sub("“|.”","","".join(result.contents[0:1]).strip()))
fResult = list(filter(None, fResult))
print(fResult)
我从上面的代码得到的输出是
['The rabbit got to the halfway point at', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at']
但我需要如下输出 -
['The rabbit got to the halfway point at here However, it couldnt see the turtle.', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at Link. he would be able to race to the finish line ahead of place, he just kept going.']
【问题讨论】:
标签: python python-3.x beautifulsoup re