【问题标题】:Extracting text inbetween HTML tags [duplicate]提取 HTML 标记之间的文本 [重复]
【发布时间】:2026-02-18 05:20:08
【问题描述】:

我有这个字符串:

In December 2011, Norway's largest online sex shop hemmelig.com was <a href="http://www.dazzlepod.com/hemmelig/?page=93" target="_blank" rel="noopener">hacked by a collective calling themselves &quot;Team Appunity&quot;</a>. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.

(别问)

该字符串内部是指向站点本身的 HREF 链接,我需要做的是提取标签 &lt;a href=""&gt;&lt;/a&gt; 之间的信息。所以最终结果应该是这样的:

In December 2011, Norway's largest online sex shop hemmelig.com was hacked by a collective calling themselves &quot;Team Appunity&quot;. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.

到目前为止,我能够做的是使用正则表达式匹配整个标签,然后用任何内容替换它:

def get_unlinked_description(descrip):
    html_tag_regex = re.compile(r"<.+>", re.I)
    return html_tag_regex.sub("", descrip)

然而,正如你所料,这个输出会删除整个字符串:

In December 2011, Norway's largest online sex shop hemmelig.com was . The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes

如何在不删除完整字符串的情况下成功提取标签之间的信息以及删除标签? .

【问题讨论】:

    标签: python python-2.7


    【解决方案1】:

    您可能正在寻找Beautiful Soup

    就您的实施而言。使用的代码是:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html_doc, 'html.parser')
    
    soup.href.string
    

    html_doc 是您要解析的字符串或文档,'html.parser' 是您要运行的 python 命令。

    这应该最终返回In December 2011, Norway's largest online sex shop hemmelig.com was hacked by a collective calling themselves &amp;quot;Team Appunity&amp;quot;. The attack exposed over 28,000 usernames and email addresses along with nicknames, gender, year of birth and unsalted MD5 password hashes.

    【讨论】: