Python Regex 提取标签内的 html 文件内容答案

【问题标题】：Python Regex extract html file content within tagsPython Regex 提取标签内的 html 文件内容
【发布时间】：2017-05-28 00:43:20
【问题描述】：

我在一个文件夹中有很多 html 格式的文件。我需要检查它们中的每一个是否包含这个标签：

<strong>QQ</strong>

并且只需要提取“QQ”及其内容。我首先阅读了其中一个文件进行测试，但似乎我的正则表达式不匹配。如果我将 fo_read 替换为标签

<strong>QQ</strong>

它会匹配。

fo = open('4251-fu.html', "r")
fo_read = fo.read()
m = re.search('<strong>(QQ)</strong>', fo_read)
if m:
    print 'Match found: ', m.group(1)
else:
    print 'No match'
fo.close()

【问题讨论】：

您是否考虑过使用 html 解析器？ Using regex to parse HTML is scary.
我有 beautifulsoup，但是 html 中有几个强大的标签。它是如何工作的？
如果您有多个标签，那是使用 HTML 解析器的另一个原因。我不熟悉这个主题，但是 BS4 文档或 the standard html module（哎呀：python2 for you）文档和一些有针对性的谷歌搜索应该会有所帮助。
您需要提取问答环节或下面的文字吗？如果是前者，一个段落，几个，一个部分..直到下一个封闭的 div 等...？
我需要它后面的文字。但我只是觉得可能需要先确认文件是否有这个标签？因为并非所有人都有这个标签

标签： python html regex

【解决方案1】：

result = soup.find("strong", string=re.compile("Question-and-Answer Session"))
if result:
    print("Question-and-Answer Session")
    # for the rest of text in the parent
    rest = result.parent.text.split("Question-and-Answer Session")[-1].strip()
    print(rest)
else:
    print("no match")

【讨论】：

返回[u'\n Question-and-Answer Session\n ']，怎么才能只得到Question-and-Answer Session？
你可以在result.parent.text.split(...)[-1]的末尾添加一个.strip()
拆分有点麻烦，任何严肃的项目都可以试试next_sibling....crummy.com/software/BeautifulSoup/bs4/doc/#going-sideways

【解决方案2】：

你可以试试 BeautifulSoup：

from bs4 import BeautifulSoup
f = open('4251-fu.html',mode = 'r')
soup = BeautifulSoup(f, 'lxml')
search_result = [str(e) for e in soup.find_all('strong')]
print search_result
if '<strong>Question-and-Answer Session</strong>' in search_result:
    print 'Match found'
else:
    print 'No match'
f.close()

输出：

['<strong>Question-and-Answer Session1</strong>', '<strong>Question-and-Answer Session</strong>', '<strong>Question-and-Answer Session3</strong>']
Match found

【讨论】：

有几个强标签，但我只想要一个有问答环节的标签
但是强标签在不同的地方，并不总是在开头。
它会在html文件中找到所有强标签，无论它在哪里。