【问题标题】:Finding links with beautifulsoup in Python在 Python 中使用 beautifulsoup 查找链接
【发布时间】:2019-01-20 21:54:20
【问题描述】:

我很难尝试从带有 beatifulsoup 的页面中提取超链接。我尝试了许多不同的标签和类,但如果没有一大堆我不想要的其他 html,我似乎无法获得它。有人能告诉我哪里出错了吗?代码如下:

from bs4 import BeautifulSoup
import requests

page_link = url

page_response = requests.get(page_link, timeout=5)

soup = BeautifulSoup(page_response.content, "html.parser")

pagecode = soup.find(class_='infinite-scroll-container')

title = pagecode.findAll('i')
artist = pagecode.find_all('h1', "exhibition-title")
links = pagecode.find_all('article', "teaser infinite-scroll-item")


printcount=0
while printcount < len(title):  
    titlestring = title[printcount].text  
    artiststring = artist[printcount].text
    artiststring = artiststring.replace(titlestring, '')
    artiststring = artiststring.strip()
    titlestring = titlestring.strip()
    print(artiststring)
    print(titlestring)
    print("----------------------------------------")
    printcount = printcount+1

【问题讨论】:

    标签: python python-3.x beautifulsoup web-crawler


    【解决方案1】:

    您可以直接定位该页面中的所有链接,然后对其进行过滤以获取文章中的链接。请注意,此页面仅在滚动时完全加载,您可能必须使用 selenium 来获取所有链接。现在我将回答如何过滤链接。

    from bs4 import BeautifulSoup
    import requests
    import re
    page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
    page_response = requests.get(page_link, timeout=5)
    soup = BeautifulSoup(page_response.content, "html.parser")
    links= soup.find_all('a')
    for link in links:
        if link.parent.name=='article':#only article links
            print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
            print(link['href'])
            print() 
    

    输出

    Nicola Farquhar A Holotype Heart 22 Nov – 21 Dec 2018 Wellington
    https://hopkinsonmossman.com/exhibitions/nicola-farquhar-5/
    
    Bill Culbert Desk Lamp, Crash 19 Oct – 17 Nov 2018 Wellington
    https://hopkinsonmossman.com/exhibitions/bill-culbert-2/
    
    Nick Austin, Ammon Ngakuru Many Happy Returns 18 Oct – 15 Nov 2018 Auckland
    https://hopkinsonmossman.com/exhibitions/nick-austin-ammon-ngakuru/
    
    Dane Mitchell Tuning 13 Sep – 13 Oct 2018 Wellington
    https://hopkinsonmossman.com/exhibitions/dane-mitchell-4/
    
    Shannon Te Ao my life as a tunnel 08 Sep – 13 Oct 2018 Auckland
    https://hopkinsonmossman.com/exhibitions/shannon-te-ao/
    
    Tilt Anoushka Akel, Ruth Buchanan, Meg Porteous 16 Aug – 08 Sep 2018 Wellington
    https://hopkinsonmossman.com/exhibitions/anoushka-akel-ruth-buchanan-meg-porteous/
    
    Shadow Work Fiona Connor, Oliver Perkins 02 Aug – 01 Sep 2018 Auckland
    https://hopkinsonmossman.com/exhibitions/group-show/
    
    Emma McIntyre Rose on red 13 Jul – 11 Aug 2018 Wellington
    https://hopkinsonmossman.com/exhibitions/emma-mcintyre-2/
    
    Tahi Moore Incomprehensible public fictions: Writer fights politician in car park 04 Jul – 28 Jul 2018 Auckland
    https://hopkinsonmossman.com/exhibitions/tahi-moore-2/
    
    Oliver Perkins Bleeding Edge 01 Jun – 07 Jul 2018 Wellington
    https://hopkinsonmossman.com/exhibitions/oliver-perkins-2/
    
    Spinning Phillip Lai, Peter Robinson 19 May – 23 Jun 2018 Auckland
    https://hopkinsonmossman.com/exhibitions/1437/
    
    Milli Jannides Cavewoman 19 Apr – 26 May 2018 Wellington
    https://hopkinsonmossman.com/exhibitions/milli-jannides/
    
    Oscar Enberg Taste & Power, a prologue 06 Apr – 12 May 2018 Auckland
    https://hopkinsonmossman.com/exhibitions/oscar-enberg/
    
    Fiona Connor Closed Down Clubs & Monochromes 09 Mar – 14 Apr 2018 Wellington
    https://hopkinsonmossman.com/exhibitions/closed-down-clubs-and-monochromes/
    
    Bill Culbert Colour Theory, Window Mobile 02 Mar – 29 Mar 2018 Auckland
    https://hopkinsonmossman.com/exhibitions/colour-theory-window-mobile/
    
    Role Models Curated by Rob McKenzie
    Robert Bittenbender, Ellen Cantor, Jennifer McCamley, Josef Strau 26 Jan – 24 Feb 2018 Auckland
    https://hopkinsonmossman.com/exhibitions/role-models/
    
    Emma McIntyre Pink Square Sways 24 Nov – 23 Dec 2017 Auckland
    https://hopkinsonmossman.com/exhibitions/emma-mcintyre/
    

    我最初的想法是使用“ajax-link”类,但结果发现“HOPKINSON MOSSMAN”链接也有该类。您也可以使用该方法并过滤掉find_all 中的第一个链接,这将得到相同的结果。

    from bs4 import BeautifulSoup
    import requests
    import re
    page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
    page_response = requests.get(page_link, timeout=5)
    soup = BeautifulSoup(page_response.content, "html.parser")
    links= soup.find_all('a',class_='ajax-link')
    for link in links[1:]:
            print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
            print(link['href'])
            print()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-08-28
      • 1970-01-01
      • 2014-07-03
      • 2019-02-19
      • 1970-01-01
      • 1970-01-01
      • 2014-11-08
      • 2016-03-18
      相关资源
      最近更新 更多