从谷歌图片搜索（python）下载图片答案

【问题标题】：Download images from google image search (python)从谷歌图片搜索（python）下载图片
【发布时间】：2014-09-27 19:45:19
【问题描述】：

我是网络抓取初学者。我首先参考https://www.youtube.com/watch?v=ZAUNEEtzsrg下载带有特定标签的图像（例如cat），它可以工作！但是我遇到了只能下载大约100张图片的新问题，这个问题看起来像“ajax”，它只加载第一页html而不加载所有。因此，我们似乎必须模拟向下滚动才能下载接下来的 100 张或更多图片。

我的代码：https://drive.google.com/file/d/0Bwjk-LKe_AohNk9CNXVQbGRxMHc/edit?usp=sharing

总结起来，问题如下：

如何通过python中的源代码下载谷歌图片搜索中的所有图片（请给我一些例子:)）
有什么我必须知道的网络抓取技术吗？

【问题讨论】：

您找到解决方案了吗？我至少应该下载 500 张图片，我也有同样的问题。似乎所有的抓取方法和google api都无法下载超过100张图片。
@Ozg，还没有……，太难了。如果您有解决方案，请与我分享，谢谢。
嘿@RyanLiu 你有什么解决办法吗？

标签： python ajax web-scraping web-crawler google-image-search

【解决方案1】：

使用 Google API 获取结果，因此将您的 URL 替换为以下内容：

https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=cat&rsz=8&start=0

你会得到 8 个结果，然后再次调用 start=7 的服务来得到下一个等等，直到您收到错误为止。

返回的数据为 JSON 格式。

这是我在网上找到的一个 Python 示例：

import urllib2
import simplejson

url = ('https://ajax.googleapis.com/ajax/services/search/images?' +
       'v=1.0&q=barack%20obama&userip=INSERT-USER-IP')

request = urllib2.Request(url, None, {'Referer': /* Enter the URL of your site here */})
response = urllib2.urlopen(request)

# Process the JSON string.
results = simplejson.load(response)
# now have some fun with the results...

至于网页抓取技术，有这个页面： http://jakeaustwick.me/python-web-scraping-resource

希望对你有帮助。

【讨论】：

谢谢，但我之前尝试过，似乎只提供了 64 个结果 [1]:(stackoverflow.com/questions/3521121/…)

【解决方案2】：

要获得 100 个结果，请尝试以下操作：

from urllib import FancyURLopener
import re
import posixpath
import urlparse 

class MyOpener(FancyURLopener, object):
    version = "Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"

myopener = MyOpener()

page = myopener.open('https://www.google.pt/search?q=love&biw=1600&bih=727&source=lnms&tbm=isch&sa=X&tbs=isz:l&tbm=isch')
html = page.read()

for match in re.finditer(r'<a href="http://www\.google\.pt/imgres\?imgurl=(.*?)&amp;imgrefurl', html, re.IGNORECASE | re.DOTALL | re.MULTILINE):
    path = urlparse.urlsplit(match.group(1)).path
    filename = posixpath.basename(path)
    myopener.retrieve(match.group(1), filename)

我可以调整biw=1600&bih=727 以获得更大或更小的图像。

【讨论】：

【解决方案3】：

我的最终解决方案是使用icrawler。

from icrawler.examples import GoogleImageCrawler

google_crawler = GoogleImageCrawler('your_image_dir')
google_crawler.crawl(keyword='sunny', offset=0, max_num=1000,
                     date_min=None, date_max=None, feeder_thr_num=1,
                     parser_thr_num=1, downloader_thr_num=4,
                     min_size=(200,200), max_size=None)

框架的优点是内置了5个爬虫（google、bing、baidu、flicker和general crawl），但是从google爬取时仍然只提供100张图片。

【讨论】：

尽管我已经安装了 icrawler，但我在 windows 和 ubuntu 中都有“没有名为 icrawler 的模块”。我们该如何解决这个问题？
下载程序线程退出，没有进行下载。有什么想法吗？
@JudyTRAj 最近 google 更改了 API，不再提供 JSON。这里有一些information可以参考。

【解决方案4】：

有任何关于icrawler的问题，可以在Github上提出问题，这样可能会得到更快的回复。

google 搜索结果的数量限制似乎是 1000。一种解决方法是定义一个日期范围，如下所示。

from datetime import date
from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(
    parser_threads=2, 
    downloader_threads=4,
    storage={'root_dir': 'your_image_dir'})
google_crawler.crawl(
    keyword='sunny',
    max_num=1000,
    date_min=date(2014, 1, 1),
    date_max=date(2015, 1, 1))
google_crawler.crawl(
    keyword='sunny',
    max_num=1000,
    date_min=date(2015, 1, 1),
    date_max=date(2016, 1, 1))

【讨论】：