【问题标题】:Extracting image links using BeautifulSoup使用 BeautifulSoup 提取图像链接
【发布时间】:2019-12-30 05:59:54
【问题描述】:

我正在尝试从 GoT wiki 页面中提取图片链接 前两个链接可以找到,但后两个给我一个 404 错误代码。 我试图找出我做错了什么。

我尝试了不同的组合来找到正确的链接。

import requests
from bs4 import BeautifulSoup
import urllib
import urllib.request as request
import re
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
soup = BeautifulSoup(html_contents, 'html.parser')
# Find all a tags in the soup 
for a in soup.find_all('a'):
    # While looping through the text if you find img in 'a' tag
    # Then print the src attribute
    if a.img: 
        print('http:/'+a.img['src'])
# And here are the images on the page

http:///upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png

http:///upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Game_of_Thrones_2011_logo.svg/300px-Game_of_Thrones_2011_logo.svg.png

http://static/images/wikimedia-button.png

http://static/images/poweredby_mediawiki_88x31.png

前两个链接有效

但我想让后两个链接也能正常工作。

【问题讨论】:

  • 这些链接在网络浏览器中也给了我 404。你是怎么得到这些链接的?也许他们在请求中需要一些标头 - 即。 RefererUser-Agent.
  • 网址是相对的 - 您必须在开头添加 https://en.wikipedia.org/ 才能获得完整的网址,例如 https://en.wikipedia.org/static/images/wikimedia-button.png

标签: python python-3.x beautifulsoup jupyter-notebook


【解决方案1】:

感谢您的帮助。我保持简单。这对我有用:

# Find all a tags in the soup 
for a in soup.find_all('a'):
    # While looping through the text if you find img in 'a' tag
    # Then print the src attribute
    if a.img:
        if a.img['src'][:2] == '//':
            print('https:'+a.img['src'])
        else:
            print('https://en.wikipedia.org/'+a.img['src'])
# And here are the images on the page

【讨论】:

    【解决方案2】:

    这些网址以/ 开头,因此它们没有域,您必须添加https://en.wikipedia.org 才能获得完整的网址,例如https://en.wikipedia.org/static/images/wikimedia-button.png

    或多或少:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
    
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    for a in soup.find_all('a'):
        if a.img:
            src = a.img['src']
            if src.startswith('http'):
                print(src)
            elif src.startswith('//'):
                print('https:' + src)
            elif src.startswith('/'):
                print('https://en.wikipedia.org' + src)
            else:
                print('https://en.wikipedia.org/w/' + src)
    

    编辑:你也可以使用urllib.parse.urljoin()

    import requests
    from bs4 import BeautifulSoup
    import urllib.parse
    
    url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
    
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    for a in soup.find_all('a'):
        if a.img:
            src = a.img['src']
            print(urllib.parse.urljoin('https://en.wikipedia.org', src))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-12-09
      • 2023-01-21
      • 1970-01-01
      • 2016-02-16
      • 1970-01-01
      • 2017-09-17
      • 1970-01-01
      相关资源
      最近更新 更多