如何使用 beautifulSoup 从网站中提取和下载所有图像？答案

【问题标题】：How to extract and download all images from a website using beautifulSoup?如何使用 beautifulSoup 从网站中提取和下载所有图像？
【发布时间】：2013-08-26 19:52:54
【问题描述】：

我正在尝试从 url 中提取和下载所有图像。我写了一个脚本

import urllib2
import re
from os.path import basename
from urlparse import urlsplit

url = "http://filmygyan.in/katrina-kaifs-top-10-cutest-pics-gallery/"
urlContent = urllib2.urlopen(url).read()
# HTML image tag: <img src="url" alt="some_text"/>
imgUrls = re.findall('img .*?src="(.*?)"', urlContent)

# download all images
for imgUrl in imgUrls:
    try:
        imgData = urllib2.urlopen(imgUrl).read()
        fileName = basename(urlsplit(imgUrl)[2])
        output = open(fileName,'wb')
        output.write(imgData)
        output.close()
    except:
        pass

我不想提取此页面的图像查看此图像http://i.share.pho.to/1c9884b1_l.jpeg 我只想获取所有图像而不单击“下一步”按钮我没有得到如何获得“下一个”类中的所有图片。我应该在 findall 中做哪些更改？

【问题讨论】：

您想使用 BeautifulSoup 但不确定如何进行？
是的。我不确定我应该如何使用 findall 或 findnext？上面的脚本将抓取该网址的所有图像，但我想要（参见图像链接）抓取单击下一步按钮后出现的该幻灯片的所有图像。
告诉我一件事你为什么要从filmygyan下载图片？那么，我可以给你解决你的问题..！
使用wget
@khan 没什么特别的。我只是在学习。

标签： python beautifulsoup

【解决方案1】：

以下内容应从给定页面中提取所有图像并将其写入运行脚本的目录。

import re
import requests
from bs4 import BeautifulSoup

site = 'http://pixabay.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

【讨论】：

是把图片保存在文件夹里吗？
'NoneType' 对象没有属性 'group'
为了回复你Mostafa，我添加了一个try and except语句，这似乎至少对我来说解决了这个问题。我仍然无法让 windows 媒体查看器看到图像....
好吧，NoneType 对象没有属性“组”只是意味着没有匹配正则表达式。我做了一个修正，打印出不匹配的 url。
您好乔纳森，感谢您更新代码以清除该问题。有什么原因导致图片下载后无法访问？

【解决方案2】：

对乔纳森的回答稍作修改（因为我无法发表评论）：将“www”添加到网站将修复大多数“不支持文件类型”错误。

import re
import requests
from bs4 import BeautifulSoup

site = 'http://www.google.com'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    if not filename:
         print("Regex didn't match with the url: {}".format(url))
         continue
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

【讨论】：

【解决方案3】：

如果您只想要图片，那么您可以直接下载它们，甚至不需要删除网页。都具有相同的 URL：

http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute1.jpg
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute2.jpg
...
http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-cutest-pics-gallery/cute10.jpg

如此简单的代码将为您提供所有图像：

import os
import urllib
import urllib2


baseUrl = "http://filmygyan.in/wp-content/gallery/katrina-kaifs-top-10-"\
      "cutest-pics-gallery/cute%s.jpg"

for i in range(1,11):
    url = baseUrl % i
    urllib.urlretrieve(url, os.path.basename(url))

使用 Beautifulsoup，您必须单击或转到下一页才能删除图像。如果您想单独废弃每个页面，请尝试使用 shutterset_katrina-kaifs-top-10-cutest-pics-gallery 的类来抓取它们

【讨论】：

但您的脚本在这种情况下将无法运行。请查看 url 是否为 filmygyan.in/…，因为此处的 url 在sexy112.jpg、sexy117.jpg、sexy12.jpg 之间随机变化。因为如果我将其范围从 (1,117) 它也会下载垃圾值。
所以你使用不同的 URL？那是完全不同的问题。如果您需要从新 URL 获取所有图像，请打开另一个问题。如果您想制作适用于您网站上所有页面的脚本，那么您必须为您的 NEW 问题提供所有必需的信息（例如每个页面上使用的类、ID 或标签）
okey。我认为这个脚本适用于所有 url，因为我在一些 url 上检查了它，但是在 2 或 3 个 url 之后我被卡住了，因为这次 url 没有遵循像 (1, 12）（1,20）。看起来我必须发布另一个问题才能从任何网址获取所有图像。
是的，你知道。但是您知道您将拥有多少个 URL，您想从中下载图像吗？我认为有一种模式可以让您制作适用于来自这些 URL 的所有页面的脚本
是的，我正在尝试找出这种模式。也许我应该寻找包含所有图像的“div”。