【问题标题】:Beautiful Soup image scraper problemsBeautiful Soup 图像刮板问题
【发布时间】:2016-01-09 02:17:43
【问题描述】:

我得到以下回溯:

Traceback (most recent call last):
  File "/home/ro/image_scrape_test.py", line 20, in <module>
    soup = BeautifulSoup(searched, "lxml")
  File "/usr/local/lib/python3.4/dist-packages/bs4/__init__.py", line 176, in __init__
    elif len(markup) <= 256:
TypeError: object of type 'NoneType' has no len()

这是我目前的代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import urllib

#searches google images
driver = webdriver.Firefox()
google_images = ("https://www.google.com/search?site=imghp&tbm=isch source=hp&biw=1366&bih=648&q=")
search_term = input("what is your search term")
searched = driver.get("{0}{1}".format(google_images, search_term))

def savepic(url):
    uri = ("/home/ro/image scrape/images/download.jpg")
    if url != "":
        urllib.urlretrieve(url, uri)

soup = BeautifulSoup(searched, "lxml")
soup1 = soup.content
images = soup1.find_all("a")

for image in images:
    savepic(image)

我刚开始,所以我很感激任何关于如何改进我的代码的提示。 谢谢你

【问题讨论】:

    标签: python selenium selenium-webdriver web-scraping beautifulsoup


    【解决方案1】:

    driver.get() 在浏览器中加载网页并返回None,这使得searched 变量具有None 值。

    您可能打算改为获取.page_source

    soup = BeautifulSoup(driver.page_source, "lxml")
    

    这里还有两点:

    • 您实际上并不需要 BeautifulSoup 此处 - 您可以使用 selenium 定位所需的图像,例如使用 driver.find_elements_by_tag_name()
    • 我尚未测试您的代码,但我认为您需要添加额外的 Explicit Waits 以使 selenium 等待页面加载

    【讨论】:

      【解决方案2】:

      searchedNone。显然,您使用的网址无效。

      【讨论】:

        【解决方案3】:

        您只能使用beautifulsouprequests 库来抓取Google 图片,selenium 不是必需的。

        例如,如果你只想提取缩略图(小分辨率尺寸),你可以通过"content-type": "image/png" query param解决方案来自MendelG)将返回缩略图链接。

        import requests
        from bs4 import BeautifulSoup
        
        params = {
            "q": "batman wallpaper",
            "tbm": "isch", 
            "content-type": "image/png",
        }
        
        html = requests.get("https://www.google.com/search", params=params)
        soup = BeautifulSoup(html.text, 'html.parser')
        
        for img in soup.select("img"):
          print(img["src"])
        
        # https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQAxU74QyJ8jn8Qq0ZK3ur_GkxjICcvmiC30DWnk03DEsi7YUgS8XXksdyybXY&s
        # https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRh5Fhah5gT9msG7vhXeQzAziS17Jp1HE_wE5O00113DtE2rJztgvxwRSonAno&s
        # ...
        

        要使用requestsbeautifulsoup 抓取全分辨率图像URL,您需要通过regex 从页面源代码中抓取数据。

        查找所有&lt;script&gt;标签:

        soup.select('script')
        

        通过regex匹配图像数据:

        matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
        

        通过regex匹配所需图像(全分辨率):

        # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
        # if you try to json.loads() without json.dumps() it will throw an error:
        # "Expecting property name enclosed in double quotes"
        matched_images_data_fix = json.dumps(matched_images_data)
        matched_images_data_json = json.loads(matched_images_data_fix)
        
        matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                            matched_images_data_json)
        

        使用bytes()decode() 提取和解码它们:

        for fixed_full_res_image in matched_google_full_resolution_images:
            original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
            original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
        

        代码和full example in the online IDE也将图像下载到文件夹:

        import requests, lxml, re, json
        from bs4 import BeautifulSoup
        
        
        headers = {
            "User-Agent":
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
        }
        
        params = {
            "q": "pexels cat",
            "tbm": "isch", 
            "hl": "en",
            "ijn": "0",
        }
        
        html = requests.get("https://www.google.com/search", params=params, headers=headers)
        soup = BeautifulSoup(html.text, 'lxml')
        
        
        def get_images_data():
        
            print('\nGoogle Images Metadata:')
            for google_image in soup.select('.isv-r.PNCib.MSM1fd.BUooTd'):
                title = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['title']
                source = google_image.select_one('.fxgdke').text
                link = google_image.select_one('.VFACy.kGQAp.sMi44c.lNHeqe.WGvvNb')['href']
                print(f'{title}\n{source}\n{link}\n')
        
            # this steps could be refactored to a more compact
            all_script_tags = soup.select('script')
        
            # # https://regex101.com/r/48UZhY/4
            matched_images_data = ''.join(re.findall(r"AF_initDataCallback\(([^<]+)\);", str(all_script_tags)))
            
            # https://kodlogs.com/34776/json-decoder-jsondecodeerror-expecting-property-name-enclosed-in-double-quotes
            # if you try to json.loads() without json.dumps it will throw an error:
            # "Expecting property name enclosed in double quotes"
            matched_images_data_fix = json.dumps(matched_images_data)
            matched_images_data_json = json.loads(matched_images_data_fix)
        
            # https://regex101.com/r/pdZOnW/3
            matched_google_image_data = re.findall(r'\[\"GRID_STATE0\",null,\[\[1,\[0,\".*?\",(.*),\"All\",', matched_images_data_json)
        
            # https://regex101.com/r/NnRg27/1
            matched_google_images_thumbnails = ', '.join(
                re.findall(r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]',
                           str(matched_google_image_data))).split(', ')
        
            print('Google Image Thumbnails:')  # in order
            for fixed_google_image_thumbnail in matched_google_images_thumbnails:
                # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
                google_image_thumbnail_not_fixed = bytes(fixed_google_image_thumbnail, 'ascii').decode('unicode-escape')
        
                # after first decoding, Unicode characters are still present. After the second iteration, they were decoded.
                google_image_thumbnail = bytes(google_image_thumbnail_not_fixed, 'ascii').decode('unicode-escape')
                print(google_image_thumbnail)
        
            # removing previously matched thumbnails for easier full resolution image matches.
            removed_matched_google_images_thumbnails = re.sub(
                r'\[\"(https\:\/\/encrypted-tbn0\.gstatic\.com\/images\?.*?)\",\d+,\d+\]', '', str(matched_google_image_data))
        
            # https://regex101.com/r/fXjfb1/4
            # https://stackoverflow.com/a/19821774/15164646
            matched_google_full_resolution_images = re.findall(r"(?:'|,),\[\"(https:|http.*?)\",\d+,\d+\]",
                                                               removed_matched_google_images_thumbnails)
        
        
            print('\nDownloading Google Full Resolution Images:')  # in order
            for index, fixed_full_res_image in enumerate(matched_google_full_resolution_images):
                # https://stackoverflow.com/a/4004439/15164646 comment by Frédéric Hamidi
                original_size_img_not_fixed = bytes(fixed_full_res_image, 'ascii').decode('unicode-escape')
                original_size_img = bytes(original_size_img_not_fixed, 'ascii').decode('unicode-escape')
                print(original_size_img)
        
        
        
        get_images_data()
        
        
        -------------
        '''
        Google Images Metadata:
        9,000+ Best Cat Photos · 100% Free Download · Pexels Stock Photos
        pexels.com
        https://www.pexels.com/search/cat/
        ...
        
        Google Image Thumbnails:
        https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcR2cZsuRkkLWXOIsl9BZzbeaCcI0qav7nenDvvqi-YSm4nVJZYyljRsJZv6N5vS8hMNU_w&usqp=CAU
        ...
        
        Full Resolution Images:
        https://images.pexels.com/photos/1170986/pexels-photo-1170986.jpeg?cs=srgb&dl=pexels-evg-culture-1170986.jpg&fm=jpg
        https://images.pexels.com/photos/3777622/pexels-photo-3777622.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500
        ...
        '''
        

        或者,您可以使用来自 SerpApi 的 Google Images API 来实现相同的目的。这是一个带有免费计划的付费 API。

        您的情况的不同之处在于,您不必处理正则表达式来匹配并从页面的源代码中提取所需的数据,相反,您只需迭代结构化 JSON 并更快地获得所需的数据。

        要集成的代码:

        import os, json # json for pretty output
        from serpapi import GoogleSearch
        
        def get_google_images():
            params = {
              "api_key": os.getenv("API_KEY"),
              "engine": "google",
              "q": "pexels cat",
              "tbm": "isch"
            }
        
            search = GoogleSearch(params)
            results = search.get_dict()
        
            print(json.dumps(results['images_results'], indent=2, ensure_ascii=False))
        
        
        get_google_images()
        
        ---------------
        '''
        [
        ...
          {
            "position": 100, # img number
            "thumbnail": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRR1FCGhFsr_qZoxPvQBDjVn17e_8bA5PB8mg&usqp=CAU",
            "source": "pexels.com",
            "title": "Close-up of Cat · Free Stock Photo",
            "link": "https://www.pexels.com/photo/close-up-of-cat-320014/",
            "original": "https://images.pexels.com/photos/2612982/pexels-photo-2612982.jpeg?auto=compress&cs=tinysrgb&dpr=1&w=500",
            "is_product": false
          }
        ]
        '''
        

        P.S - 我写了一篇关于如何抓取 Google Imageshow to reduce the chance of being blocked while web scraping search engines 的更深入的博文。

        免责声明,我为 SerpApi 工作。

        【讨论】:

          猜你喜欢
          • 2023-03-13
          • 2018-04-22
          • 2011-10-09
          • 1970-01-01
          • 2021-12-08
          • 1970-01-01
          • 2022-07-20
          • 1970-01-01
          • 2019-07-21
          相关资源
          最近更新 更多