从网站下载所有图像的最快和最简单的方法是什么答案

【问题标题】：What's the fastest and easiest way to download all the images from a website从网站下载所有图像的最快和最简单的方法是什么
【发布时间】：2012-01-11 00:28:24
【问题描述】：

从网站下载所有图片最快、最简单的方法是什么？更具体地说，http://www.cycustom.com/large/。

我在想一些类似 wget 或 curl 的东西。

为了澄清，首先（也是最重要的）我目前不知道如何完成这项任务。其次，我有兴趣了解 wget 或 curl 是否有更易于理解的解决方案。谢谢。

--- 更新@sarnold---

感谢您的回复。我认为这也可以解决问题。但是，事实并非如此。这是命令的输出：

wget --mirror --no-parent http://www.cycustom.com/large/
--2012-01-10 18:19:36--  http://www.cycustom.com/large/
Resolving www.cycustom.com... 64.244.61.237
Connecting to www.cycustom.com|64.244.61.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `www.cycustom.com/large/index.html'

    [  <=>                                                                                                                                                                                                                                  ] 188,795      504K/s   in 0.4s    

Last-modified header missing -- time-stamps turned off.
2012-01-10 18:19:37 (504 KB/s) - `www.cycustom.com/large/index.html' saved [188795]

Loading robots.txt; please ignore errors.
--2012-01-10 18:19:37--  http://www.cycustom.com/robots.txt
Connecting to www.cycustom.com|64.244.61.237|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 174 [text/plain]
Saving to: `www.cycustom.com/robots.txt'

100%[======================================================================================================================================================================================================================================>] 174         --.-K/s   in 0s      

2012-01-10 18:19:37 (36.6 MB/s) - `www.cycustom.com/robots.txt' saved [174/174]

FINISHED --2012-01-10 18:19:37--
Downloaded: 2 files, 185K in 0.4s (505 KB/s)

这是https://img.skitch.com/20120111-nputrm7hy83r7bct33midhdp6d.jpg创建的文件的图片

我的目标是拥有一个充满图像文件的文件夹。下面的命令没有达到这个目的。

wget --mirror --no-parent http://www.cycustom.com/large/

【问题讨论】：

@sarnold Here's a picture of the index.html file created w/ some notes

标签： curl wget

【解决方案1】：

wget --mirror --no-parent http://www.example.com/large/

--no-parent 可以防止它破坏整个网站。

啊，我看到他们放了一个robots.txt 要求机器人不要从那个目录下载照片：

$ curl http://www.cycustom.com/robots.txt
User-agent: *
Disallow: /admin/
Disallow: /css/
Disallow: /flash/
Disallow: /large/
Disallow: /pdfs/
Disallow: /scripts/
Disallow: /small/
Disallow: /stats/
Disallow: /temp/
$

wget(1) 没有记录任何忽略robots.txt 的方法，而且我从未找到一种简单的方法来执行--mirror 中的--mirror 等效项。如果您想继续使用wget(1)，则需要在中间插入一个HTTP 代理，该代理会为GET /robots.txt 请求返回404。

我认为改变方法更容易。因为我想要更多使用Nokogiri 的经验，所以我想出了以下办法：

#!/usr/bin/ruby
require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://www.cycustom.com/large/"))

doc.css('tr > td > a').each do |link|
  name = link['href']
  next unless name.match(/jpg/)
  File.open(name, "wb") do |out|
    out.write(open("http://www.cycustom.com/large/" + name))
  end
end

这只是一个快速而肮脏的脚本——将 URL 嵌入两次有点难看。因此，如果这是为了长期生产使用，请先清理它——或者弄清楚如何使用rsync(1)。

【讨论】：

编辑了原始问题以包含您的建议结果

【解决方案2】：

可以通过添加以下选项来忽略robots.txt 文件：

-e robots=off

我还建议添加一个减慢下载速度的选项，以限制服务器上的负载。例如，此选项在一个文件和下一个文件之间等待 30 秒：

--wait 30

【讨论】：