【问题标题】:regex, finding all "href" in <a> tags [duplicate]正则表达式,在 <a> 标签中找到所有“href”[重复]
【发布时间】:2014-01-17 10:42:39
【问题描述】:

我有一个在标签中搜索“href”属性的正则表达式,但目前效果不佳:

<a[^>]* href="([^"]*)"

从这里发现:

<a href="http://something" title="Development of the Python language and website">Core Development</a>

这一行:

<a href="http://something"

但我只需要找到:

http://something

【问题讨论】:

标签: python regex


【解决方案1】:

这似乎对我有用?您可以自己查看正在工作的demo

matches = re.findall(r'<a[^>]* href="([^"]*)"', html)

相反,我会使用Beautiful Soup 来实现这一点...

from bs4 import BeautifulSoup

html = '''
<a href="http://something" title="Development of the Python language and website">Core Development</a>
<a href="http://something.com" title="Development of the Python language and website">Core Development</a>
'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print a['href']

注意:如果您使用的是旧版本的 Beautiful Soup,那么您可以改用以下版本:

for a in soup.findAll('a', href=True):

【讨论】:

【解决方案2】:

试试这个:

re.findall(r'(?<=<a href=")[^"]*',yourStr)

【讨论】:

    【解决方案3】:

    不用重新发明轮子,可以使用http://www.crummy.com/software/BeautifulSoup/

    $ sudo pip install beautifulsoup4
    $ python
    >>> html_doc = """
    ... <html><head><title>The Dormouse's story</title></head>
    ... <body>
    ... <p class="title"><b>The Dormouse's story</b></p>
    ... 
    ... <p class="story">Once upon a time there were three little sisters; and their names were
    ... <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    ... <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    ... <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    ... and they lived at the bottom of a well.</p>
    ... 
    ... <p class="story">...</p>
    ... """
    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(html_doc)
    >>> href = [i.get('href') for i in soup.find_all('a')]
    >>> href
    ['http://example.com/elsie', 'http://example.com/lacie', 'http://example.com/tillie']
    

    不用安装beautifulsoup,包,你可以从http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz下载旧版本

    $ wget http://www.crummy.com/software/BeautifulSoup/bs3/download/3.x/BeautifulSoup-3.2.1.tar.gz
    $ tar xvzf BeautifulSoup-3.2.1.tar.gz
    $ cp BeautifulSoup-3.2.1/BeautifulSoup.py .
    $ python
    >>> import BeautifulSoup
    

    【讨论】:

      【解决方案4】:

      你也可以使用(http[s]?:[^"\s]*)

      【讨论】:

        【解决方案5】:

        您可以尝试 re 模块中的匹配方法,然后使用组选择匹配

            import re
            str1='''<a href="http://something" title="Development of the Python language and website">Core Development</a>'''
            pattern = re.compile(r'<a.*href="(.*)" ')
            m = pattern.match(str1)
            match = m.group(1)
            print match
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2020-04-13
          • 2018-05-14
          • 1970-01-01
          • 2011-08-13
          • 2016-06-03
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多