【问题标题】:Remove all the <u> and <a> tags from within all <div> tags of a class using BeautifulSoup or re使用 BeautifulSoup 或 re 从类的所有 <div> 标记中删除所有 <u> 和 <a> 标记
【发布时间】:2023-03-29 11:47:01
【问题描述】:

我正在尝试从 HTML 源中具有“sf-item”类的所有 DIV 标签中删除 &lt;u&gt;&lt;a&gt; 标签,因为它们在从 Web url 抓取时破坏了文本。

(对于这个演示,我已经为 BeautifulSoup 方法分配了一个示例 html 字符串 - 但理想情况下它应该是一个作为源的 Web URL)

到目前为止,我已经尝试在下面的行中使用 re - 但我不确定如何在 re 中指定一个条件 - 仅在类 sf-item 的 DIV 标记内仅删除所有 &lt;u /u&gt; 之间的子字符串

data = re.sub('<u.*?u>', '', data)

还尝试使用下面的行从整个源中删除所有&lt;u&gt;&lt;a&gt; 标签,但不知何故它不起作用。我有点不确定如何仅在具有类 sf-item 的 DIV 标记中指定所有 &lt;u&gt;&lt;a&gt; 标记。

for tag in soup.find_all('u'):
    tag.replaceWith('')

如果您能帮我实现这一点,不胜感激。

以下是有效的示例 Python 代码 -

from re import sub
from bs4 import BeautifulSoup
import re

data = """
<div class="sf-item"> The rabbit got to the halfway point at   
<u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle. 
</div>
<div class="sf">
<div class="sf-item sf-icon">
<span class="supporticon is"></span>
</div>
<div class="sf-item"> He was hot and tired and decided to stop and take a short nap. 
</div>
<div class="sf-item"> Even if the turtle passed him at 
<u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of 
<u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
</div>
"""

# data = re.sub('<u.*?u>', '', data)  ## This works for this particular string but I cannot use on a web url
# It would solve if I can somehow specify to remove <u> and <a> only within DIV of class sf-item

soup = BeautifulSoup(data, "html.parser")

for tag in soup.find_all('u'):
    tag.replaceWith('')

fResult = []
rMessage=soup.findAll("div",{'class':"sf-item"})

for result in rMessage:
    fResult.append(sub("&ldquo;|.&rdquo;","","".join(result.contents[0:1]).strip()))

fResult = list(filter(None, fResult))
print(fResult)

我从上面的代码得到的输出是

['The rabbit got to the halfway point at', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at']

但我需要如下输出 -

['The rabbit got to the halfway point at here However, it couldnt see the turtle.', 'He was hot and tired and decided to stop and take a short nap.', 'Even if the turtle passed him at Link. he would be able to race to the finish line ahead of place, he just kept going.']

【问题讨论】:

    标签: python python-3.x beautifulsoup re


    【解决方案1】:

    BeautifulSoup 有一个内置方法,用于从标签中获取可见文本(即在浏览器中呈现时将显示的文本)。运行以下代码,我得到了预期的输出:

    from re import sub
    from bs4 import BeautifulSoup
    import re
    
    data = """
    <div class="sf-item"> The rabbit got to the halfway point at   
    <u><a href="https://DummyLocationURL/"> here </a></u> However, it couldn't see the turtle. 
    </div>
    <div class="sf">
    <div class="sf-item sf-icon">
    <span class="supporticon is"></span>
    </div>
    <div class="sf-item"> He was hot and tired and decided to stop and take a short nap. 
    </div>
    <div class="sf-item"> Even if the turtle passed him at 
    <u><a href="https://DummyLocationURL/">Link</a></u>. he would be able to race to the finish line ahead of 
    <u><a href="https://DummyLocationURL/">place</a></u>, he just kept going.
    </div>
    """
    
    soup = BeautifulSoup(data, "html.parser")
    
    rMessage=soup.findAll("div",{'class':"sf-item"})
    
    fResult = []
    
    for result in rMessage:
        fResult.append(result.text.replace('\n', ''))
    

    这会给你正确的输出,但有一些额外的空格。如果您想将它们全部减少为单个空格,可以通过以下方式运行 fResult:

    fResult = [re.sub(' +', ' ', result) for result in fResult]

    【讨论】:

      猜你喜欢
      • 2015-09-17
      • 1970-01-01
      • 2011-06-05
      • 1970-01-01
      • 2015-11-28
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多