【问题标题】:Get image data-src with Beautiful Soup when there is no image extension没有图片扩展时,用 Beautiful Soup 获取图片 data-src
【发布时间】:2022-01-05 02:49:30
【问题描述】:

我正在尝试获取此页面上所有书籍的所有图片网址 https://www.nb.co.za/en/books/0-6-years 和漂亮的汤。

这是我的代码:

from bs4 import BeautifulSoup
import requests

baseurl = "https://www.nb.co.za/"
productlinks = []

r = requests.get(f'https://www.nb.co.za/en/books/0-6-years')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_="book-slider-frame")

def my_filter(tag):
    return (tag.name == 'a' and
        tag.parent.name == 'div' and
        'img-container' in tag.parent['class'])

for item in productlist:
    for link in item.find_all(my_filter, href=True):
        productlinks.append(baseurl + link['href'])

        cover = soup.find_all('div', class_="img-container")
        print(cover)

这是我的输出:

<div class="img-container">
<a href="/en/view-book/?id=9780798182539">
<img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
</a>
</div>

我希望得到什么:

https://www.nb.co.za/en/helper/ReadImage/25929.jpg

我的问题是:

  1. 如何仅获取数据源?

  2. 如何获取图片的扩展名?

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    1:如何仅获取数据源?

    您可以拨打element['data-src']访问data-src

    cover = baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
    

    2:如何获取图片的扩展名?

    您可以像提到的 diggusbickus 一样访问文件的扩展名(好方法),但是如果您尝试像 https://www.nb.co.za/en/helper/ReadImage/25929.jpg 这样请求文件,这将无济于事这将导致 404 错误

    图片动态加载/提供附加信息 -> https://stackoverflow.com/a/5110673/14460824

    示例

    baseurl = "https://www.nb.co.za/"
    nocover = '/Content/images/no-cover.jpg'
    data = []
    
    for item in soup.select('.book-slider-frame'):
        
        data.append({
            'link' : baseurl+item.a['href'],
            'cover' : baseurl+item.img['data-src'] if item.img['src'] != nocover else baseurl+nocover
        })
        
    data
    

    输出

    [{'link': 'https://www.nb.co.za//en/view-book/?id=9780798182539',
      'cover': 'https://www.nb.co.za//en/helper/ReadImage/25929'},
     {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182546',
      'cover': 'https://www.nb.co.za//en/helper/ReadImage/25931'},
     {'link': 'https://www.nb.co.za//en/view-book/?id=9780798182553',
      'cover': 'https://www.nb.co.za//en/helper/ReadImage/25925'},...]
    

    【讨论】:

    • 非常感谢,这对我有用!
    【解决方案2】:

    我将向您展示如何针对这个小示例执行此操作,剩下的交给您。只需使用imghdr 模块

    import imghdr
    
    import requests
    from bs4 import BeautifulSoup
    
    data="""<div class="img-container">
    <a href="/en/view-book/?id=9780798182539">
    <img class="lazy" data-src="/en/helper/ReadImage/25929" src="/Content/images/loading5.gif"/>
    </a>
    </div>"""
    soup=BeautifulSoup(data, 'lxml')
    base_url="https://www.nb.co.za"
    img_src=soup.select_one('img')['data-src']
    img_name=img_src.split('/')[-1]
    data=requests.get(base_url+img_src)
    with open(img_name, 'wb') as f:
        f.write(data.content)
    
    print(imghdr.what(img_name))
    >>> jpeg
    

    【讨论】:

    • 谢谢你,这帮了很多忙。上面的答案也有效,并且更适合我的需求。感谢您的意见。
    【解决方案3】:

    要等到所有图像都加载完毕,您可以告诉requests 使用timeout argument 或将其设置为timeout=None,这将告诉requests 在页面加载缓慢时永远等待响应。

    您在图像结果末尾得到.gif 的原因是图像尚未加载,并且 gif 正在显示。

    你可以access data-src attribute the same way you would access a dictionary:class[attribute]


    如果要在本地保存图片,可以使用urllib.request.urlretrieve

    import urllib.request
    
    urllib.request.urlretrieve("BOOK_COVER_URL", file_name.jpg) # will save in the current directory
    

    代码和example in the online IDE

    from bs4 import BeautifulSoup
    import requests, lxml
    
    response = requests.get(f'https://www.nb.co.za/en/books/0-6-years', timeout=None)
    soup = BeautifulSoup(response.text, 'lxml')
    
    for result in soup.select(".img-container"):
        link = f'https://www.nb.co.za{result.select_one("a")["href"]}'
    
        # try/except to handle error when there's no image on the website (last 3 results)
        try:
            image = f'https://www.nb.co.za{result.select_one("a img")["data-src"]}'
        except: image = None
    
        print(link, image, sep="\n")
    
    
    # part of the output:
    '''
    # first result (Step by Step: Counting to 50)
    https://www.nb.co.za/en/view-book/?id=9780798182539
    https://www.nb.co.za/en/helper/ReadImage/25929
    
    # last result WITH image preview (Dinosourusse - Feite en geite: Daar’s ’n trikeratops op die trampoline)
    https://www.nb.co.za/en/helper/ReadImage/10853
    https://www.nb.co.za/en/view-book/?id=9780624035480
    
    # last result (Uhambo lukamusa (isiZulu)) WITH NO image preview on the website as well so it returned None
    https://www.nb.co.za/en/view-book/?id=9780624043003
    None
    '''
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-10-24
      • 1970-01-01
      • 2017-10-22
      • 1970-01-01
      • 2016-06-08
      • 2013-11-25
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多