使用 BeautifulSoup 提取 HTML 注释之间的文本答案

【问题标题】：Extracting Text Between HTML Comments with BeautifulSoup使用 BeautifulSoup 提取 HTML 注释之间的文本
【发布时间】：2016-04-12 22:44:08
【问题描述】：

使用 Python 3 和 BeautifulSoup 4，我希望能够从 HTML 页面中提取仅由其上方的注释描述的文本。一个例子：

<\!--UNIQUE COMMENT-->
I would like to get this text
<\!--SECOND UNIQUE COMMENT-->
I would also like to find this text

我找到了各种方法来提取页面的文本或 cmets，但没有办法做我正在寻找的事情。任何帮助将不胜感激。

【问题讨论】：

标签： python python-3.x web-scraping beautifulsoup

【解决方案1】：

您只需要遍历所有可用的 cmets 以查看它是否是您需要的条目之一，然后显示以下元素的文本，如下所示：

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
        print comment.next_element.strip()

这将显示以下内容：

I would like to get this text
I would also like to find this text

【讨论】：

我刚才正要这样做。 +1
完全符合我的需要。非常感谢。

【解决方案2】：

对 Martin 的回答进行了改进 - 您可以直接搜索特定的 cmets - 无需遍历所有评论然后检查值 - 一次性完成：

comments_to_search_for = {'UNIQUE COMMENT', 'SECOND UNIQUE COMMENT'}
for comment in soup.find_all(text=lambda text: isinstance(text, Comment) and text in comments_to_search_for):
    print(comment.next_element.strip())

打印：

I would like to get this text
I would also like to find this text

【讨论】：

【解决方案3】：

Python 的bs4 模块有一个Comment 类。您可以使用它来提取 cmets。

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
comments = soup.findAll(text=lambda text:isinstance(text, Comment))

这将为您提供评论元素。

[u'UNIQUE COMMENT', u'SECOND UNIQUE COMMENT']

【讨论】：

我认为 OP 正在尝试提取 cmets 之间的文本，而不是 cmets 本身。
I would like to get this text -这个？
是的，那个。我能够很好地提取 cmets。