【问题标题】:Parsing google images search results解析谷歌图片搜索结果
【发布时间】:2013-10-24 11:33:54
【问题描述】:

我在解析 Google 图片搜索的结果时遇到问题。我试过用selenium webdriver 来做。它返回了 100 个结果,但速度很慢。我决定请求一个带有 requests 模块的页面,它只返回了 20 个结果。如何获得相同的 100 个结果?有什么方法可以分页吗?
这是selenium 代码:

_url = r'imgurl=([^&]+)&'

for search_url in lines:
    driver.get(normalize_search_url(search_url))

    images = driver.find_elements(By.XPATH, u"//div[@class='rg_di']")
    print "{0} results for {1}".format(len(images), ' '.join(driver.title.split(' ')[:-3]))
    with open('urls/{0}.txt'.format(search_url.strip().replace('\t', '_')), 'ab') as f:
        for image in images:
            url = image.find_element(By.TAG_NAME, u"a")
            u = re.findall(_url, url.get_attribute("href"))
            for item in u:
                f.write(item)
                f.write('\n')

这里是requests 代码:

_url = r'imgurl=([^&]+)&'

for search_url in lines[:10]:
    print normalize_search_url(search_url)
    links = 0
    request = requests.get(normalize_search_url(search_url))
    soup = BeautifulSoup(request.text)
    file = 'cars2/{0}.txt'.format(search_url.strip().replace(' ', '_'))
    with open(file, 'ab') as f:
        for image in soup.find_all('a'):
            if 'imgurl' in image.get('href'):
                links += 1
            u = re.findall(_url, image.get("href"))
            for item in u:
                f.write(item)
                f.write('\n')
                print item
        print "{0} links extracted for {1}".format(links, ' '.join(soup.title.name.split(' ')[:-3]))

【问题讨论】:

    标签: python selenium python-requests


    【解决方案1】:

    我从来没有尝试过使用 selenium,但您是否尝试过使用 Google 的搜索引擎 API?它可能对你有用:https://developers.google.com/products/#google-search

    另外,他们对 API 的限制是每天 100 个请求,所以我认为你不会超过 100 个

    【讨论】:

      【解决方案2】:

      您可以使用beautifulsouprequests 库来抓取Google 图片,selenium 不是必需的。

      要获得一批 100 张图像,您可以在查询参数中使用。 "ijn=0" -> 100 张图片,"ijn=1" -> 200 张图片。

      要使用requestsbeautifulsoup 抓取完整分辨率的图像URL,您需要通过regex 从页面源代码中抓取数据。

      查找所有<script>标签:

      soup.select('script')
      

      通过regex 匹配来自<script> 标签的图像数据:

      matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
      

      通过regex匹配所需图像(全分辨率):

      # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
      # if you try to json.loads() without json.dumps() it will throw an error:
      # "Expecting property name enclosed in double quotes"
      matched_images_data_fix = json.dumps(matched_images_data)
      matched_images_data_json = json.loads(matched_images_data_fix)
      
      matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                          matched_images_data_json)
      

      使用bytes()decode() 提取和解码它们:

      for fixed_full_res_image in matched_google_full_resolution_images:
          original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
          original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
      

      同样下载图片的代码和full example in the online IDE

      import requests, lxml, re, json
      from bs4 import BeautifulSoup
      
      headers = {
          "User-Agent":
          "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
      }
      
      params = {
          "q": "pexels cat",
          "tbm": "isch", 
          "hl": "en",
          "ijn": "0",
      }
      
      html = requests.get("https://www.google.com/search", params=params, headers=headers)
      soup = BeautifulSoup(html.text, 'lxml')
      
      
      def get_images_data():
      
          print('\nGoogle Images Metadata:')
          for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
              title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
              source = google_image.select_one('.fxgdke').text
              link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
              print(f'{title}\n{source}\n{link}\n')
      
          # this steps could be refactored to a more compact
          all_script_tags = soup.select('script')
      
          # # https://regex101.com/r/48UZhY/4
          matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
          
          # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
          # if you try to json.loads() without json.dumps it will throw an error:
          # "Expecting property name enclosed in double quotes"
          matched_images_data_fix = json.dumps(matched_images_data)
          matched_images_data_json = json.loads(matched_images_data_fix)
      
          # https://regex101.com/r/pdZOnW/3
          matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
      
          # https://regex101.com/r/NnRg27/1
          matched_google_images_thumbnails = ', '.join(
              re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                         str(matched_google_image_data))).split(', ')
      
          print('Google Image Thumbnails:')  # in order
          for fixed_google_image_thumbnail in matched_google_images_thumbnails:
              # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
              google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
      
              # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
              google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
              print(google_image_thumbnail)
      
          # removing previously matched thumbnails for easier full resolution image matches.
          removed_matched_google_images_thumbnails = re.sub(
              r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
      
          # https://regex101.com/r/fXjfb1/4
          # https://stackoverflow.com/a/19821774/15164646
          matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                             removed_matched_google_images_thumbnails)
      
      
          print('\nDownloading Google Full Resolution Images:')  # in order
          for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
              # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
              original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
              original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
              print(original_size_img)
      
      
      
      get_images_data()
      
      
      -------------
      '''
      Google Images Metadata:
      9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
      pexels.com
      https://www.pexels.com/search/cat/
      ...
      
      Google Image Thumbnails:
      https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
      ...
      
      Full Resolution Images:
      https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
      https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
      ...
      '''
      

      或者,您可以使用来自 SerpApi 的 Google Images API 来实现相同的目的。这是一个带有免费计划的付费 API。

      不同之处在于,您不必处理regex,绕过 Google 的阻止,并在发生崩溃时随着时间的推移对其进行维护。相反,您只需要遍历结构化 JSON 并获取您想要的数据。

      要集成的代码:

      import os, json # json for pretty output
      from serpapi import GoogleSearch
      
      def get_google_images():
          params = {
            "api_key": os.getenv("API_KEY"),
            "engine": "google",
            "q": "pexels cat",
            "tbm": "isch"
          }
      
          search = GoogleSearch(params)
          results = search.get_dict()
      
          print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
      
      
      get_google_images()
      
      ---------------
      '''
      [
      ... # other images 
        {
          "position": 100, # img number
          "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
          "source": "pexels.com",
          "title": "Close-up of Cat · Free Stock Photo",
          "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
          "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
          "is_product": false
        }
      ]
      '''
      

      P.S - 我写了一篇关于如何抓取 Google Imageshow to reduce the chance of being blocked while web scraping search engines 的更深入的博文。

      免责声明,我为 SerpApi 工作。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-11-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-01-15
        相关资源
        最近更新 更多