正则表达式，在 <a> 标签中找到所有“href”[重复]答案

【问题标题】：regex, finding all "href" in <a> tags [duplicate]正则表达式，在 <a> 标签中找到所有“href”[重复]
【发布时间】：2014-01-17 10:42:39
【问题描述】：

我有一个在标签中搜索“href”属性的正则表达式，但目前效果不佳：

<a[^>]* href="([^"]*)"

从这里发现：

<a href="http://something" title="Development of the Python language and website">Core Development</a>

这一行：

<a href="http://something"

但我只需要找到：

http://something

【问题讨论】：

你能告诉我们你的python代码吗？
必读：stackoverflow.com/questions/1732348/…
您可以尝试以下建议的 DOM 解析器之一：stackoverflow.com/questions/2782097/…
用户已在 12 月 27 日提出同样的问题：stackoverflow.com/questions/20802691/…

标签： python regex

【解决方案1】：

这似乎对我有用？您可以自己查看正在工作的demo。

matches = re.findall(r'<a[^>]* href="([^"]*)"', html)

相反，我会使用Beautiful Soup 来实现这一点...

from bs4 import BeautifulSoup

html = '''
<a href="http://something" title="Development of the Python language and website">Core Development</a>
<a href="http://something.com" title="Development of the Python language and website">Core Development</a>
'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print a['href']

注意：如果您使用的是旧版本的 Beautiful Soup，那么您可以改用以下版本：

for a in soup.findAll('a', href=True):

【讨论】：

href=True 是一个简洁的捷径 =)。顺便说一句，在bs3 中使用soup.findAll()。在bs4 使用soup.find_all() stackoverflow.com/questions/20218788/…
href="([^"]*)"

【解决方案2】：

试试这个：

re.findall(r'(?<=<a href=")[^"]*',yourStr)

【讨论】：

【解决方案3】：

不用重新发明轮子，可以使用http://www.crummy.com/software/BeautifulSoup/

$ sudo pip install beautifulsoup4
$ python
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head>
... <body>
... <p class="title"><b>The Dormouse's story</b></p>
... 
... <p class="story">Once upon a time there were three little sisters; and their names were
... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
... and they lived at the bottom of a well.</p>
... 
... <p class="story">...</p>
... """
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html_doc)
>>> href = [i.get('href') for i in soup.find_all('a')]
>>> href
['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']

不用安装beautifulsoup，包，你可以从http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz下载旧版本

$ wget http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz
$ tar xvzf BeautifulSoup-3.2.1.tar.gz
$ cp BeautifulSoup-3.2.1/BeautifulSoup.py .
$ python
>>> import BeautifulSoup

【讨论】：

【解决方案4】：

你也可以使用(http[s]?:[^"\s]*)

【讨论】：

【解决方案5】：

您可以尝试 re 模块中的匹配方法，然后使用组选择匹配

    import re
    str1='''<a href="http://something" title="Development of the Python language and website">Core Development</a>'''
    pattern = re.compile(r'<a.*href="(.*)" ')
    m = pattern.match(str1)
    match = m.group(1)
    print match

【讨论】：