每天从 url 抓取不同的图像答案

【问题标题】：Scraping different image every day from url每天从 url 抓取不同的图像
【发布时间】：2018-12-31 07:34:21
【问题描述】：

我正在尝试用 Python 编写一个脚本，用于下载该站点上每天更新的图像：

https://apod.nasa.gov/apod/astropix.html

我试图关注这篇文章的热门评论： How to extract and download all images from a website using beautifulSoup?

所以，这就是我的代码目前的样子：

import re
import requests
from bs4 import BeautifulSoup

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

但是，当我运行我的程序时，我得到了这个错误：

Traceback on line 17
with open(filename.group(1), 'wb' as f:
AttributeError: 'NoneType' object has no attribute 'group'

看来我的正则表达式可能有问题？

【问题讨论】：

你的图片会因为它的格式而损坏，现在应该是apod.nasa.gov/apod/astropix.htmlimage/1807/FermiFinals1200.jpg的网址apod.nasa.gov/apod/image/1807/FermiFinals1200.jpg

标签： python

【解决方案1】：

您正在查看的正则表达式 group() 是 0，而不是 1。它包含图像路径。此外，当图像源路径是相对的时，url 格式会不正确。我使用urllib 内置模块来解析站点url：

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'([\w_-]+[.](jpg|gif|png))$', url)
    filename = re.sub(r'\d{4,}\.', '.', filename.group(0))

    with open(filename, 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            hostname = urlparse(site).hostname
            scheme = urlparse(site).scheme
            url = '{}://{}/{}'.format(scheme, hostname, url)

        # for full resolution image the last four digits needs to be striped
        url = re.sub(r'\d{4,}\.', '.', url)

        print('Fetching image from {} to {}'.format(url, filename))
        response = requests.get(url)
        f.write(response.content)

输出：

Fetching image from https://apod.nasa.gov/image/1807/FermiFinals.jpg to FermiFinals.jpg

图像保存为 FermiFinals.jpg

【讨论】：

这对我有用。还有一个问题，有什么办法可以得到更高的图像分辨率？例如，当运行它时，它会下载一张 181 KB 的图片，但如果我从网站手动下载图片，它会给我一张 1.48 MB 的图片。我认为当我单击图片并在新选项卡中打开它，然后我下载它时会发生差异。
@K.Hall 是的，您需要从 URL 中删除最后四位数字。我更新了我的答案。

【解决方案2】：

我认为问题在于site 变量。当一切都说完了，它正在尝试附加site 和https://apod.nasa.gov/apod/astropix.html 的图像路径。如果您只是删除 astropix.html 它工作正常。我在下面的内容只是对您所拥有的内容的一个小修改，复制/粘贴并发送它！

import re
import requests
from bs4 import BeautifulSoup

site = "https://apod.nasa.gov/apod/astropix.html"
site_path_only = site.replace("astropix.html","")

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            url = '{}{}'.format(site_path_only, url)
        response = requests.get(url)
        f.write(response.content)

请注意，如果它正在下载图像但说它已损坏并且大小为 1k，则您可能出于某种原因收到404。只需在记事本中打开“图像”并阅读它返回的 HTML。

【讨论】：