使用 BeautifulSoup 提取图像链接答案

【问题标题】：Extracting image links using BeautifulSoup使用 BeautifulSoup 提取图像链接
【发布时间】：2019-12-30 05:59:54
【问题描述】：

我正在尝试从 GoT wiki 页面中提取图片链接前两个链接可以找到，但后两个给我一个 404 错误代码。我试图找出我做错了什么。

我尝试了不同的组合来找到正确的链接。

import requests
from bs4 import BeautifulSoup
import urllib
import urllib.request as request
import re

url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
html_contents = r.text
soup = BeautifulSoup(html_contents, 'html.parser')

# Find all a tags in the soup 
for a in soup.find_all('a'):
    # While looping through the text if you find img in 'a' tag
    # Then print the src attribute
    if a.img: 
        print('http:/'+a.img['src'])
# And here are the images on the page

http:///upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png

http:///upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Game_of_Thrones_2011_logo.svg/300px-Game_of_Thrones_2011_logo.svg.png

http://static/images/wikimedia-button.png

http://static/images/poweredby_mediawiki_88x31.png

前两个链接有效

但我想让后两个链接也能正常工作。

【问题讨论】：

这些链接在网络浏览器中也给了我 404。你是怎么得到这些链接的？也许他们在请求中需要一些标头 - 即。 Referer 或 User-Agent.
网址是相对的 - 您必须在开头添加 https://en.wikipedia.org/ 才能获得完整的网址，例如 https://en.wikipedia.org/static/images/wikimedia-button.png

标签： python python-3.x beautifulsoup jupyter-notebook

【解决方案1】：

感谢您的帮助。我保持简单。这对我有用：

# Find all a tags in the soup 
for a in soup.find_all('a'):
    # While looping through the text if you find img in 'a' tag
    # Then print the src attribute
    if a.img:
        if a.img['src'][:2] == '//':
            print('https:'+a.img['src'])
        else:
            print('https://en.wikipedia.org/'+a.img['src'])
# And here are the images on the page

【讨论】：

【解决方案2】：

这些网址以/ 开头，因此它们没有域，您必须添加https://en.wikipedia.org 才能获得完整的网址，例如https://en.wikipedia.org/static/images/wikimedia-button.png

或多或少：

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

for a in soup.find_all('a'):
    if a.img:
        src = a.img['src']
        if src.startswith('http'):
            print(src)
        elif src.startswith('//'):
            print('https:' + src)
        elif src.startswith('/'):
            print('https://en.wikipedia.org' + src)
        else:
            print('https://en.wikipedia.org/w/' + src)

编辑：你也可以使用urllib.parse.urljoin()

import requests
from bs4 import BeautifulSoup
import urllib.parse

url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

for a in soup.find_all('a'):
    if a.img:
        src = a.img['src']
        print(urllib.parse.urljoin('https://en.wikipedia.org', src))

【讨论】：