python解析字符串后的url答案

【问题标题】：python parsing url after stringpython解析字符串后的url
【发布时间】：2010-03-01 10:41:10
【问题描述】：

我想从一个 url（链接）中提取一个字符串。该字符串位于<h3></h3> 标记中。

 link = http://www.test.com/page.html

 Content of link: <h3>Text here</h3>

首先获取 page.html 的内容/源代码然后提取链接的优雅方法是什么？谢谢！

【问题讨论】：

标签： python regex parsing

【解决方案1】：

我推荐Beatiful Soup。这是一个很好的 HTML 页面解析器（在大多数情况下，您不必担心页面格式不正确）。

【讨论】：

【解决方案2】：

import urllib2
url="http://www.test.com/page.html"
page=urllib2.urlopen(url)
data=page.read()
for item in data.split("</h3>"):
    if "<h3>" in item:
         print item.split("<h3>")[1]

【讨论】：

【解决方案3】：

您可以使用 URLLib2 来检索 URL 的内容：

http://docs.python.org/library/urllib2.html

然后您可以使用 Python 库中的 HTML 解析器来查找正确的内容：

http://docs.python.org/library/htmlparser.html

【讨论】：

【解决方案4】：

如果你想要的文本是页面上的唯一 <h3>-wrapped文本，试试：

from urllib2 import urlopen
from re import search
text = search(r'(?<=<h3>).+?(?=</h3>)', urlopen(link).read()).group(0)

如果有多个<h3>-wrapped 字符串，您可以在模式中添加更多细节或使用re.finditer()/re.findall()

【讨论】：

您应该使用非贪婪限定符，否则它可能会匹配类似“标题........
其他标题”
OP的任务就是获取
标签，使用正则就可以了。

其他标题”

标签，使用正则就可以了。