如何使用 BeautifulSoup 找到评论标签 ？答案

【问题标题】：How to find the comment tag  with BeautifulSoup?如何使用 BeautifulSoup 找到评论标签 ？
【发布时间】：2011-08-29 01:34:39
【问题描述】：

我尝试了 soup.find('!--') 但它似乎不起作用。提前致谢。

编辑：感谢您提供有关如何查找所有 cmets 的提示。我有一个后续问题。我如何专门搜索评论？

例如，我有以下评论标签：

我真的只是想要这些东西Wednesday 110518。 “110518”是我倾向于用作搜索目标的日期 YYMMDD。但是，我不知道如何在特定的评论标签中找到一些东西。

【问题讨论】：

标签： python html tags beautifulsoup

【解决方案1】：

您可以通过findAll 方法找到文档中的所有cmets。请参阅此示例，该示例显示了如何准确地执行您正在尝试执行的操作Removing elements：

简而言之，你想要这个：

comments = soup.findAll(text=lambda text:isinstance(text, Comment))

编辑：如果您尝试在列中搜索，您可以尝试：

import re
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
for comment in comments:
  e = re.match(r'<i>([^<]*)</i>', comment.string).group(1)
  print e

【讨论】：

搜索特定评论怎么样？我正在尝试在 html 文件中搜索此内容：请注意 110518，这只是 yymmdd 中的日期，我如何仅搜索该评论标签中的信息，特别是仅在中的信息？
@1stsage 也许您想将该要求添加到您的问题中。
1stsage，针对您的具体情况更新了我的帖子。下次，请确保您的问题包含您想要做的事情。
@1stsage 关于搜索评论的内容，如果它是有效的 html，您也可以对其进行解析。或者您可以使用字符串方法甚至正则表达式。有了这么小的文本块和简单的要求，我会选择一个正则表达式（比如r'\<i\>(.*?)\</i\>'）。

【解决方案2】：

Pyparsing 允许您使用内置的 htmlComment 表达式搜索 HTML cmets，并附加解析时回调以验证和提取注释中的各种数据字段：

from pyparsing import makeHTMLTags, oneOf, withAttribute, Word, nums, Group, htmlComment
import calendar

# have pyparsing define tag start/end expressions for the 
# tags we want to look for inside the comments
span,spanEnd = makeHTMLTags("span")
i,iEnd = makeHTMLTags("i")

# only want spans with class=titlefont
span.addParseAction(withAttribute(**{'class':'titlefont'}))

# define what specifically we are looking for in this comment
weekdayname = oneOf(list(calendar.day_name))
integer = Word(nums)
dateExpr = Group(weekdayname("day") + integer("daynum"))
commentBody = '<!--' + span + i + dateExpr("date") + iEnd

# define a parse action to attach to the standard htmlComment expression,
# to extract only what we want (or raise a ParseException in case 
# this is not one of the comments we're looking for)
def grabCommentContents(tokens):
    return commentBody.parseString(tokens[0])
htmlComment.addParseAction(grabCommentContents)


# let's try it
htmlsource = """
want to match this one
<!-- <span class="titlefont"> <i>Wednesday 110518</i>(05:00PM)<br /></span> -->

don't want the next one, wrong span class
<!-- <span class="bodyfont"> <i>Wednesday 110519</i>(05:00PM)<br /></span> -->

not even a span tag!
<!-- some other text with a date in italics <i>Wednesday 110520</i>(05:00PM)<br /></span> -->

another matching comment, on a different day
<!-- <span class="titlefont"> <i>Thursday 110521</i>(05:00PM)<br /></span> -->
"""

for comment in htmlComment.searchString(htmlsource):
    parsedDate = comment.date
    # date info can be accessed like elements in a list
    print parsedDate[0], parsedDate[1]
    # because we named the expressions within the dateExpr Group
    # we can also get at them by name (this is much more robust, and 
    # easier to maintain/update later)
    print parsedDate.day
    print parsedDate.daynum
    print

打印：

Wednesday 110518
Wednesday
110518

Thursday 110521
Thursday
110521

【讨论】：

最新版本的 pyparsing 现在包含 withClass 以简化 withAttribute 丑陋。