【问题标题】:Error while downloading images from Wikipedia via python script通过 python 脚本从维基百科下载图像时出错
【发布时间】:2025-12-21 01:40:12
【问题描述】:

我正在尝试下载特定*页面的所有图像。这是代码sn-p

from bs4 import BeautifulSoup as bs
import urllib2
import urlparse
from urllib import urlretrieve

site="http://en.wikipedia.org/wiki/Pune"
hdr= {'User-Agent': 'Mozilla/5.0'}
outpath=""
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup =bs(page)
tag_image=soup.findAll("img")
for image in tag_image:
        print "Image: %(src)s" % image
        urlretrieve(image["src"], "/home/mayank/Desktop/test") 

运行程序后,我看到以下堆栈出现错误

Image: //upload.wikimedia.org/wikipedia/commons/thumb/0/04/Pune_Montage.JPG/250px-Pune_Montage.JPG
Traceback (most recent call last):
  File "download_images.py", line 15, in <module>
    urlretrieve(image["src"], "/home/mayank/Desktop/test")
  File "/usr/lib/python2.7/urllib.py", line 93, in urlretrieve
    return _urlopener.retrieve(url, filename, reporthook, data)
  File "/usr/lib/python2.7/urllib.py", line 239, in retrieve
    fp = self.open(url, data)
  File "/usr/lib/python2.7/urllib.py", line 207, in open
    return getattr(self, name)(url)
  File "/usr/lib/python2.7/urllib.py", line 460, in open_file
    return self.open_ftp(url)
  File "/usr/lib/python2.7/urllib.py", line 543, in open_ftp
    ftpwrapper(user, passwd, host, port, dirs)
  File "/usr/lib/python2.7/urllib.py", line 864, in __init__
    self.init()
  File "/usr/lib/python2.7/urllib.py", line 870, in init
    self.ftp.connect(self.host, self.port, self.timeout)
  File "/usr/lib/python2.7/ftplib.py", line 132, in connect
    self.sock = socket.create_connection((self.host, self.port), self.timeout)
  File "/usr/lib/python2.7/socket.py", line 571, in create_connection
    raise err
IOError: [Errno ftp error] [Errno 111] Connection refused

请帮忙看看是什么导致了这个错误?

【问题讨论】:

    标签: python web-crawler beautifulsoup


    【解决方案1】:

    // 是当前协议的简写。似乎 Wikipedia 正在使用速记,因此您必须明确指定 HTTP 而不是 FTP(Python 出于某种原因假设):

    for image in tag_image:
        src = 'http:' + image
    

    【讨论】:

    • 谢谢@Blender:这解决了我的问题但是我想补充一件事,这样如果有人提到这个问题,他就不会被误导。附加 http 和图像不会像答案中提到的那样工作。而是我这样做了: urlretrieve('http:'+image["src"], outpath)
    最近更新 更多