正则表达式：提取两个标记之间的文本答案

【问题标题】：regular expressions: extract text between two markers正则表达式：提取两个标记之间的文本
【发布时间】：2014-08-17 21:11:28
【问题描述】：

我正在尝试编写一个 Python 解析器来从 html 页面中提取一些信息。

它应该从<p itemprop="xxx">和</p>之间提取文本

我使用正则表达式：

m = re.search(ur'p>(?P<text>[^<]*)</p>', html)

但是如果它们之间是另一个标签，它就无法解析文件。例如：

<p itemprop="xxx"> some text <br/> another text </p>

据我了解，[^<] 仅对一个符号例外。 “除了</p>”怎么写？

【问题讨论】：

使用 HTML 解析器，例如 Beautiful Soup。正则表达式不适合这种解析。
*.com/a/1732454/699864
见*.com/questions/1732348/…

标签： python regex

【解决方案1】：

你可以使用：

m = re.search(ur'p>(?P<text>.*?)</p>', html)

这是一个惰性匹配，它将匹配到</p> 之前的所有内容。您还应该考虑使用像 BeautifulSoup 这样的 HTML 解析器，它在安装后可以与 CSS Selectors 一起使用，如下所示：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
m = soup.select('p[itemprop="xxx"]')

【讨论】：

一个小修正：.* 是一个贪心匹配。 .*? 是非贪婪匹配。您正确指定了.*?，但描述不正确。

【解决方案2】：

1) 切勿使用正则表达式解析 HTML。

2) 以下正则表达式有时会在某些 HTML 上起作用：

#!/usr/bin/python2.7

import re

pattern = ur'''
    (?imsx)             # ignore case, multiline, dot-matches-newline, verbose
    <p.*?>              # match first marker
    (?P<text>.*?)       # non-greedy match anything
    </p.*?>             # match second marker
'''

print re.findall(pattern, '<p>hello</p>')
print re.findall(pattern, '<p>hello</p> and <p>goodbye</p>')
print re.findall(pattern, 'before <p>hello</p> and <p><i>good</i>bye</p> after')
print re.findall(pattern, '<p itemprop="xxx"> some text <br/> another text </p>')

正如另一个答案指出的那样，.*? 是匹配任何字符的非贪婪模式。

【讨论】：