使用python从网页下载图像答案

【问题标题】：Download image from webpage using python使用python从网页下载图像
【发布时间】：2013-03-11 23:17:59
【问题描述】：

我正在尝试编写一个从网页下载图像的 python 脚本。在网页上（我使用的是 NASA 的每日图片页面），每天都会发布一张新图片，文件名不同。

所以我的解决方案是使用 HTMLParser 解析 html，查找“jpg”，并将图像的路径和文件名写入 HTML 解析器对象的属性（命名为“输出”，参见下面的代码）。

我是 python 和 OOP 的新手（这是我第一个真正的 python 脚本），所以我不确定这是否是通常的做法。欢迎任何建议和指点。

这是我的代码：

# Grab image url
response = urllib2.urlopen('http://apod.nasa.gov/apod/astropix.html')
html = response.read() 

class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
    # Only parse the 'anchor' tag.
    if tag == "a":
       # Check the list of defined attributes.
       for name, value in attrs:
           # If href is defined, print it.
           if name == "href":
               if value[len(value)-3:len(value)]=="jpg":
                   #print value
                   self.output=value #return the path+file name of the image

parser = MyHTMLParser()
parser.feed(html)
imgurl='http://apod.nasa.gov/apod/'+parser.output

【问题讨论】：

如果您的代码正在运行，并且您只是希望 cmets 了解可能的改进方法，您可能希望咨询 Code Review 的优秀人员：codereview.stackexchange.com
...我不知道 codereview 存在...谢谢
另外，一旦你有了 URL，如果你想下载图片，你可以os.system('wget ' + imgurl)。您需要import os，这可能仅适用于 Linux 系统。
@xbonez: 应该使用subprocess 而不是os.system() 例如subprocess.check_call(['curl', '-O', imgurl])
@xbonez：或没有外部进程：urllib.urlretrieve(imgurl, 'output.jpg')

标签： python html-parsing web-crawler

【解决方案1】：

要检查字符串是否以"jpg" 结尾，您可以使用.endswith() 而不是len() 和切片：

if name == "href" and value.endswith("jpg"):
   self.output = value

如果网页内的搜索比较复杂，您可以使用lxml.html 或BeautifulSoup 代替HTMLParser 例如：

from lxml import html

# download & parse web page
doc = html.parse('http://apod.nasa.gov/apod/astropix.html').getroot()

# find <a href that ends with ".jpg" and 
# that has <img child that has src attribute that also ends with ".jpg"
for elem, attribute, link, _ in doc.iterlinks():
    if (attribute == 'href' and elem.tag == 'a' and link.endswith('.jpg') and
        len(elem) > 0 and elem[0].tag == 'img' and
        elem[0].get('src', '').endswith('.jpg')):
        print(link)

【讨论】：