没有图片扩展时，用 Beautiful Soup 获取图片 data-src答案

【问题标题】：Get image data-src with Beautiful Soup when there is no image extension没有图片扩展时，用 Beautiful Soup 获取图片 data-src
【发布时间】：2022-01-05 02:49:30
【问题描述】：

我正在尝试获取此页面上所有书籍的所有图片网址 https://www.nb.co.za/en/books/0-6-years 和漂亮的汤。

这是我的代码：

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.nb.co.za/"
productlinks = []

r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")

def my_filter(tag):
    return (tag.name == 'a' and
        tag.parent.name == 'div' and
        'img-container' in tag.parent['class'])

for item in productlist:
    for link in item.find_all(my_filter, href=True):
        productlinks.append(baseurl + link['href'])

        cover = soup.find_all('div', class_="img-container")
        print(cover)

这是我的输出：

<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>

我希望得到什么：

https://www.nb.co.za/en/helper/ReadImage/25929.jpg

我的问题是：

如何仅获取数据源？
如何获取图片的扩展名？

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

1：如何仅获取数据源？

您可以拨打element['data-src']访问data-src：

cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover

2：如何获取图片的扩展名？

您可以像提到的 diggusbickus 一样访问文件的扩展名（好方法），但是如果您尝试像 https://www.nb.co.za/en/helper/ReadImage/25929.jpg 这样请求文件，这将无济于事这将导致 404 错误。

图片动态加载/提供附加信息 -> https://stackoverflow.com/a/5110673/14460824

示例

baseurl = "https://www.nb.co.za/"
nocover = '/Content/images/no-cover.jpg'
data = []

for item in soup.select('.book-slider-frame'):
    
    data.append({
        'link' : baseurl+item.a['href'],
        'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
    })
    
data

输出

[{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
 {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
 {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
  'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]

【讨论】：

非常感谢，这对我有用！

【解决方案2】：

我将向您展示如何针对这个小示例执行此操作，剩下的交给您。只需使用imghdr 模块

import imghdr

import requests
from bs4 import BeautifulSoup

data="""<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>"""
soup=BeautifulSoup(data, 'lxml')
base_url="https://www.nb.co.za"
img_src=soup.select_one('img')['data-src']
img_name=img_src.split('/')[-1]
data=requests.get(base_url+img_src)
with open(img_name, 'wb') as f:
    f.write(data.content)

print(imghdr.what(img_name))
>>> jpeg

【讨论】：

谢谢你，这帮了很多忙。上面的答案也有效，并且更适合我的需求。感谢您的意见。

【解决方案3】：

要等到所有图像都加载完毕，您可以告诉requests 使用timeout argument 或将其设置为timeout=None，这将告诉requests 在页面加载缓慢时永远等待响应。

您在图像结果末尾得到.gif 的原因是图像尚未加载，并且 gif 正在显示。

你可以access data-src attribute the same way you would access a dictionary:class[attribute]

如果要在本地保存图片，可以使用urllib.request.urlretrieve：

import urllib.request

urllib.request.urlretrieve("BOOK_COVER_URL", file_name.jpg) # will save in the current directory

代码和example in the online IDE：

from bs4 import BeautifulSoup
import requests, lxml

response = requests.get(f'https://www.nb.co.za/en/books/0-6-years', timeout=None)
soup = BeautifulSoup(response.text, 'lxml')

for result in soup.select(".img-container"):
    link = f'https://www.nb.co.za{result.select_one("a")["href"]}'

    # try/except to handle error when there's no image on the website (last 3 results)
    try:
        image = f'https://www.nb.co.za{result.select_one("a img")["data-src"]}'
    except: image = None

    print(link, image, sep="\n")


# part of the output:
'''
# first result (Step by Step: Counting to 50)
https://www.nb.co.za/en/view-book/?id=9780798182539
https://www.nb.co.za/en/helper/ReadImage/25929

# last result WITH image preview (Dinosourusse - Feite en geite: Daar’s ’n trikeratops op die trampoline)
https://www.nb.co.za/en/helper/ReadImage/10853
https://www.nb.co.za/en/view-book/?id=9780624035480

# last result (Uhambo lukamusa (isiZulu)) WITH NO image preview on the website as well so it returned None
https://www.nb.co.za/en/view-book/?id=9780624043003
None
'''

【讨论】：