【问题标题】:How to extract all the hrefs and titles from several `<a href="" title=""> tags?如何从多个 `<a href="" title=""> 标签中提取所有的 href 和标题?
【发布时间】:2017-04-07 19:11:51
【问题描述】:

鉴于此file

<a data-parent="#accordion1" data-toggle="collapse" href="# fruitName1" title="Click to expand drug name">
<span class="list-unstyled" style="text-decoration: none;"></span> GLIPIZIDE 
         </a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114223" title="Click to view LEMONS (LEMONS) | POQ  #114223 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 1 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114226" title="Click to view LEMONS (LEMONS) | POQ  #114226 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 2 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114305" title="Click to view LEMONS (LEMONS) | POQ  #114305 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 3 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114370" title="Click to view LEMONS (LEMONS) | POQ  #114370 | BOX;67 PZ | Discontinued | FRUIT COMPANY 1 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114378" title="Click to view LEMONS (LEMONS) | POQ  #114378 | BOX;67 PZ | Discontinued | FRUIT COMPANY 4 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114387" title="Click to view LEMONS (LEMONS) | POQ  #114387 | BOX;67 PZ | Discontinued | FRUIT COMPANY 5 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114438" title="Click to view LEMONS (LEMONS) | POQ  #114438 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 2 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114497" title="Click to view LEMONS (LEMONS) | POQ  #114497 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 5 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114542" title="Click to view LEMONS (LEMONS) | POQ  #114542 | BOX;67 PZ | Discontinued | FRUIT COMPANY 3 ">
                              LEMONS (LEMONS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114550" title="Click to view LEMONS (LEMONS) | POQ  #114550 | 
         </a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=117270" title="Click to view GRAPES (GREEN GRAPES ; AUS) | POQ  #117270 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 10  ">
                              GRAPES (GREEN GRAPES ; AUS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=117511" title="Click to view GRAPES (GREEN GRAPES ; AUS) | POQ  #117511 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 11 ">
                              GRAPES (GREEN GRAPES ; AUS)</a>
<a href="/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=117620" title="Click to view GRAPES (GREEN GRAPES ; AUS) | POQ  #117620 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 12 ">

使用正则表达式或漂亮的汤,如何提取所有&lt;a href="" title=""&gt;,在href标签前添加www.example.com

www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114223 |  title= | Click to view LEMONS (LEMONS) | POQ  #114223 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 1 | LEMONS (LEMONS)
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114226 |  title= | Click to view LEMONS (LEMONS) | POQ  #114226 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 2 | LEMONS (LEMONS)
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114305 |  title= | Click to view LEMONS (LEMONS) | POQ  #114305 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 3 | LEMONS (LEMONS)
www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&amp;ApplNo=114370 |  title= | Click to view LEMONS (LEMONS) | POQ  #114370 | BOX;67 PZ | Discontinued | FRUIT COMPANY 1 | LEMONS (LEMONS)

我尝试过:

for a in soup.tbody.findAll('a', href=True):
    r = re.compile('(?<=href=").*?(?=")')
    r.findall(str(a)

还有:

for a in soup.tbody.findAll('a', href=True):
    print (a.find('a')['href'])
    print (a.find('a')['title'])

但是,我不知道如何重新排列标题和 href。 更新

根据 odradek 的回答,我尝试了这个:

soup = BeautifulSoup(open('file.htm'), 'lxml')
for a in soup.tbody.findAll('a', href=True):
    html = a
    PREFIX = 'www.example.com'
    template = '{prefix}{url} | {title}'.format
    links = [template(prefix=PREFIX, url=e['href'], title=e['title']) for e in html.find_all('a', href=True)]
    print(links)

但是我得到了:

[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]

【问题讨论】:

    标签: python regex python-3.x beautifulsoup


    【解决方案1】:

    您可以使用 BeautifulSoup 解析方法而不是复杂的正则表达式:

    # this is the url you want to add at the beginning
    PREFIX = 'www.example.com'
    
    # the template of your desired output
    template = '{prefix}{url} | {title}'.format
    
    # the resulting list, please note that "html" variable is
    # the given source code.
    links = [template(prefix=PREFIX, url=e.get('href'), title=e.get('title'))
             for e in html.find_all('a', href=True)]
    

    当遇到您列表中的两个 a 标记时:

    $ python get_all_a.py
    www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=117511 | Click to view GRAPES (GREEN GRAPES ; AUS) | POQ  #117511 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 11 
    www.example.com/loads/data/usersindex.cfm?event=overview.subprocess&ApplNo=117620 | Click to view GRAPES (GREEN GRAPES ; AUS) | POQ  #117620 | BOX;67 PZ | PRESENTATION | FRUIT COMPANY 12 
    

    根据您的更新,您不应该将这段代码放在 for 循环中,而是:

    html = BeautifulSoup(open('file.htm'), 'html.parser')
    
    PREFIX = 'www.example.com'
    
    template = '{prefix}{url} | {title}'.format
    
    # inside this list comprehension is your for loop implied
    links = [template(prefix=PREFIX, url=e.get('href'), title=e.get('title'))
             for e in html.find_all('a', href=True)]
    

    【讨论】:

    • 我尝试soup = BeautifulSoup(open('/file.htm'), 'lxml') PREFIX = 'www.example.com' template = '{prefix}{url} | {title}'.format links = [template(prefix=PREFIX, url=e['href'], title=e['title']) for e in html.find_all('a', href=True)] print(links) 得到一个空列表:[]
    • 那不完全是我的代码。你正在使用lxml 解析器,而我正在使用html.parser。此外,您正在将bs4.BeautifulSoup 对象加载到soup 变量中,同时在列表理解中迭代html
    • 对不起...我用上面的代码尝试增益并得到:--------------------------------------------------------------------------- KeyError Traceback (most recent call last) &lt;ipython-input-10-cf84682c681c&gt; in &lt;module&gt;() 9 # inside this list comprehension is your for loop implied 10 links = [template(prefix=PREFIX, url=e['href'], title=e['title']) ---&gt; 11 for e in html.find_all('a', href=True)]
    • 好的,我更新了代码,所以它不会因 KeyError 而失败,但我认为您正在加载的源代码有问题。要么您正在加载另一个文件(请注意open(/file.htm) 中的前导/,如果您使用的是linux,则表示根目录),或者加载的文件不包含该代码示例。我刚刚针对您发布的所有a 标签运行了我的代码(包含在一个名为file.htm 的文件中,并且运行顺利。
    • 感谢您的帮助和耐心等待!.. 我会检查的!
    【解决方案2】:

    这不是正则表达式的任务。您可以使用 odradek 的回答中介绍的 BeautifulSoup,或者我最喜欢的替代方案 lxml,在我看来,这会导致代码更具可读性:

    from lxml import etree
    
    tree = etree.fromstring(html)
    for element in tree.xpath('//a'):
        print('www.example.com' + element.get('href'))
        print('title: ' + element.get('title'))
    

    【讨论】:

    • 从文件中解析使用etree.parse('../file.htm')而不是etree.fromstring(html)
    • 感谢您的帮助!...而不是打印它们,如何将它们附加到单个列表中?
    • 不客气。请将其中一个答案标记为解决方案。如果您有其他问题,您可以提出新问题。但是我强烈建议首先阅读一些基本教程。 (提示:tutorialspoint.com/python/python_lists.htm
    • 感谢参考
    猜你喜欢
    • 1970-01-01
    • 2019-10-17
    • 2021-01-10
    • 2010-10-27
    • 1970-01-01
    • 2015-12-04
    • 2021-12-26
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多