如何从 BeautifulSoup 下载图像？答案

【问题标题】：How to download images from BeautifulSoup?如何从 BeautifulSoup 下载图像？
【发布时间】：2016-09-06 14:17:42
【问题描述】：

图片http://i.imgur.com/OigSBjF.png

import requests from bs4 import BeautifulSoup

r = requests.get("xxxxxxxxx")
汤 = BeautifulSoup(r.content)

对于链接中的链接：
    如果链接.get（'src'）中的“http”：
       打印链接.get('src')

我得到了打印出来的 URL，但不知道如何使用它。

【问题讨论】：

BeautifulSoup 用于解析 HTML，requests 用于通过 HTTP 发出请求。下载属于后一类。 requests.get 该 URL，然后查看有关如何保存响应正文的文档。

标签： python python-2.7 beautifulsoup scrape

【解决方案1】：

虽然其他答案完全正确。

我发现下载速度很慢，不知道真正高分辨率图像的进度。

所以，做了这个。

from bs4 import BeautifulSoup
import requests
import subprocess

url = "https://example.site/page/with/images"
html = requests.get(url).text # get the html
soup = BeautifulSoup(html, "lxml") # give the html to soup

# get all the anchor links with the custom class 
# the element or the class name will change based on your case
imgs = soup.findAll("a", {"class": "envira-gallery-link"})
for img in imgs:
    imgUrl = img['href'] # get the href from the tag
    cmd = [ 'wget', imgUrl ] # just download it using wget.
    subprocess.Popen(cmd) # run the command to download
    # if you don't want to run it parallel;
    # and wait for each image to download just add communicate
    subprocess.Popen(cmd).communicate()

警告：它在 win/mac 上无法使用，因为它使用 wget。

奖励：如果您不使用通信，您可以看到每个图像的进度。

【讨论】：

为 wget 启动子进程不应该比为 http 使用 python 库更快。
我无法运行此代码，我得到与运行 subprocess.Popen(cmd) 或 subprocess.Popen(cmd).communicate() 相关的 FileNotFoundError: [WinError 2] The system cannot find the file specified

【解决方案2】：

您需要下载并写入磁盘：

import requests
from os.path  import basename

r = requests.get("xxx")
soup = BeautifulSoup(r.content)

for link in links:
    if "http" in link.get('src'):
        lnk = link.get('src')
        with open(basename(lnk), "wb") as f:
            f.write(requests.get(lnk).content)

您还可以使用 select 过滤您的标签以仅获取带有 http 链接的标签：

for link in soup.select("img[src^=http]"):
        lnk = link["src"]
        with open(basename(lnk)," wb") as f:
            f.write(requests.get(lnk).content)

【讨论】：

不用担心，使用 select 方法是过滤标签的最佳方法