【问题标题】:Scraping different image every day from url每天从 url 抓取不同的图像
【发布时间】:2018-12-31 07:34:21
【问题描述】:

我正在尝试用 Python 编写一个脚本,用于下载该站点上每天更新的图像:

https://apod.nasa.gov/apod/astropix.html

我试图关注这篇文章的热门评论: How to extract and download all images from a website using beautifulSoup?

所以,这就是我的代码目前的样子:

import re
import requests
from bs4 import BeautifulSoup

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

但是,当我运行我的程序时,我得到了这个错误:

Traceback on line 17
with open(filename.group(1), 'wb' as f:
AttributeError: 'NoneType' object has no attribute 'group'

看来我的正则表达式可能有问题?

【问题讨论】:

标签: python


【解决方案1】:

您正在查看的正则表达式 group() 是 0,而不是 1。它包含图像路径。此外,当图像源路径是相对的时,url 格式会不正确。我使用urllib 内置模块来解析站点url:

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'([\w_-]+[.](jpg|gif|png))$', url)
    filename = re.sub(r'\d{4,}\.', '.', filename.group(0))

    with open(filename, 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            hostname = urlparse(site).hostname
            scheme = urlparse(site).scheme
            url = '{}://{}/{}'.format(scheme, hostname, url)

        # for full resolution image the last four digits needs to be striped
        url = re.sub(r'\d{4,}\.', '.', url)

        print('Fetching image from {} to {}'.format(url, filename))
        response = requests.get(url)
        f.write(response.content)

输出:

Fetching image from https://apod.nasa.gov/image/1807/FermiFinals.jpg to FermiFinals.jpg

图像保存为 FermiFinals.jpg

【讨论】:

  • 这对我有用。还有一个问题,有什么办法可以得到更高的图像分辨率?例如,当运行它时,它会下载一张 181 KB 的图片,但如果我从网站手动下载图片,它会给我一张 1.48 MB 的图片。我认为当我单击图片并在新选项卡中打开它,然后我下载它时会发生差异。
  • @K.Hall 是的,您需要从 URL 中删除最后四位数字。我更新了我的答案。
【解决方案2】:

我认为问题在于site 变量。当一切都说完了,它正在尝试附加sitehttps://apod.nasa.gov/apod/astropix.html 的图像路径。如果您只是删除 astropix.html 它工作正常。我在下面的内容只是对您所拥有的内容的一个小修改,复制/粘贴并发送它!

import re
import requests
from bs4 import BeautifulSoup

site = "https://apod.nasa.gov/apod/astropix.html"
site_path_only = site.replace("astropix.html","")

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            url = '{}{}'.format(site_path_only, url)
        response = requests.get(url)
        f.write(response.content)

请注意,如果它正在下载图像但说它已损坏并且大小为 1k,则您可能出于某种原因收到404。只需在记事本中打开“图像”并阅读它返回的 HTML。

【讨论】:

    猜你喜欢
    • 2014-02-15
    • 2017-09-21
    • 2011-03-31
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-12-09
    相关资源
    最近更新 更多