如何使用 urllib 从网络下载图像答案

【问题标题】：How to use urllib to download image from web如何使用 urllib 从网络下载图像
【发布时间】：2011-12-05 17:12:48
【问题描述】：

我正在尝试使用此代码下载图像：

from urllib import urlretrieve
urlretrieve('http://gdimitriou.eu/wp-content/uploads/2008/04/google-image-search.jpg', 
            'google-image-search.jpg')

成功了。图片已下载，可通过任何图片查看器软件打开。

但是，下面的代码不起作用。下载的图片只有 2KB，任何图片查看器都无法打开。

from urllib import urlretrieve
urlretrieve('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg', 
            'Zindagi1976.jpg')

这是 HTML 格式的结果。

    ERROR

The requested URL could not be retrieved

While trying to retrieve the URL: http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg

The following error was encountered:

Access Denied.
Access control configuration prevents your request from being allowed at this time. Please contact your service provider if you feel this is incorrect.

Your cache administrator is nobody. 
Generated Mon, 05 Dec 2011 17:19:53 GMT by sq56.wikimedia.org (squid/2.7.STABLE9)

【问题讨论】：

2KB 通常是纯文本或 html。尝试将 'Zindagi1976.jpg' 更改为 'Zindagi1976.html' 并在浏览器中打开它。该信息可能有助于调试。（我怀疑是标题问题。）请在此处发布。
维基媒体似乎正在检查您的请求。当您在浏览器中导航到图像时，它会向 Wikimedia.org 发送有关您的设置的信息（例如，您的 user-agent）。根据 Python 发送的任何内容，它拒绝访问。我不知道如何使用 urlretrieve 解决这个问题。 curl 可能可以做你想做的事，虽然它不是最好的解决方案。
您的请求似乎被拒绝了。如果服务器拒绝访问未知的网络代理，我不会感到惊讶。
没有理由使用 pastebin。请直接在您的问题中发布相关信息。
urlretrieve 的问题是它会接受服务器返回的任何内容并将其保存为 jpeg 文件。如果服务器返回 page not found 或其他错误，这是有问题的，因为您必须找出它被发送的内容。删除扩展并在记事本中打开它，看看它给你发送了什么。

标签： python urllib

【解决方案1】：

如果你使用了以下，你可以下载图片：

wget http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg

但如果你做了以下事情：

from urllib import urlretrieve
urlretrieve('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg', 
            'Zindagi1976.jpg')

您可能无法下载图片。可能是这种情况，因为维基百科可能有规则（robot.txt）来拒绝机器人或机器人（未知客户端）。 尝试模拟浏览器。

为此，您必须将以下内容添加为标题的一部分：

('User-agent', 
 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) 
 Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')

你可以这样做：

>>> from urllib import FancyURLopener
>>> class MyOpener(FancyURLopener):
...     version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'
... 
>>> myopener = MyOpener()
>>> myopener.retrieve('http://upload.wikimedia.org/wikipedia/en/4/44/Zindagi1976.jpg', 'Zindagi1976.jpg')
('Zindagi1976.jpg', <httplib.HTTPMessage instance at 0x1007bfe18>)

这会检索文件

【讨论】：

我累了。 NameError: name 'FancyURLopener' is not defined
@no_access ：谢谢！。我只是更改了问题，以便于搜索。
我正在寻找一种从 url 获取 http 响应代码的快速方法。如果代码是200' then download the images. Can i get response code with MyOpener`？谢谢
@Organic：使用“头部”请求。这已在stackoverflow.com/questions/107405/… 的另一个 SO 问题中得到解答