【问题标题】:Scraping images injected by javascript in Python with Selenium使用 Selenium 在 Python 中抓取 javascript 注入的图像
【发布时间】:2016-07-17 07:19:56
【问题描述】:

我正在尝试在 Mac OSX 上使用 Python 制作网络爬虫,我正在测试的一个示例是从 MyFonts 页面加载标签和图像(例如 here)。最初我使用的是 BeautifulSoup,但我注意到该网站最初加载了一个“blank.png”来代替我试图抓取的字体图像,然后用 js 替换为“真实”的图像。 我正在尝试使用 Selenium,我可以使用 webdriverwait 来监听 img src 中的更改,类似于下面的示例,但不是通过 ID 或类?

ff = webdriver.Firefox()
ff.get("http://www.myfonts.com/fonts/fort-foundry/gin/")
try:
    element = WebDriverWait(ff, 10).until(EC.presence_of_element_located((By.ID, "myDynamicElement")))
finally:
    ff.quit()

理想情况下,这应该等待 not img src="*/blank.png" 因为元素不会更改类或获得一致的名称。还是我应该等到页面完全加载完成?刮板必须经历很多这样的事情,所以我试图让它保持相当快。

我对 Python 很陌生,因此非常感谢任何帮助。

【问题讨论】:

    标签: python selenium selenium-webdriver web-scraping


    【解决方案1】:

    首先,确保您所做的事情是合法的:Legal page

    等待至少一个字体样本被加载,然后继续提取:

    # wait for at least one font sample to be loaded
    wait = WebDriverWait(ff, 10)
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#overview_samples .search-result-item")))
    
    # get font sample urls
    for sample in ff.find_elements_by_css_selector("#overview_samples .search-result-item .sample .fontsample[title]"):
        print(sample.get_attribute("src"))
    

    打印:

    http://samples.myfonts.net/e_91/u/e7/19061adcc0c9ac025d0414e5ff11a1.gif
    http://samples.myfonts.net/a_91/u/e5/4d795cdae0cb99d1424b13020d0f6e.gif
    ...
    http://samples.myfonts.net/b_92/u/2c/4c21ddeb53f19f109306746dac6b24.gif
    

    【讨论】:

      【解决方案2】:

      我同意 Alex 关于合法性的说法,但如果您使用 requests 和 bs4 模仿 Ajax 请求,您也可以获得字体:

      In [16]: import requests
      
      In [17]: from bs4 import BeautifulSoup
      
      In [18]: data = {
         ....:     'seed': '24',
         ....:     "text": "Pangrams",
         ....:     "src": "pangram.auto",
         ....:     "size": "72",
         ....:     "fg": "000000",
         ....:     "bg": "ffffff",
         ....:     "goodies": "_2x:0",
         ....:     "w": "720",
         ....:     "i[]": ["fort-foundry/gin/regular,,720", "fort-foundry/gin/oblique,,720", "fort-foundry/gin/rough,,720",
         ....:             "fort-foundry/gin/rough-oblique,,720", "fort-foundry/gin/round,,720","fort-foundry/gin/round-oblique,,720",
         ....:             "fort-foundry/gin/lines,,720", "fort-foundry/gin/lines-oblique,,720"],
         ....:     "showimgs": "true"}
      
      In [19]: js = requests.post("http://www.myfonts.com/ajax-server/testdrive_new-ajax.php", data=data).json()
      
      In [20]: 
      
      In [20]: urls = [img["src"] for img in BeautifulSoup("".join(js.values()),"lxml").find_all("img")]
      
      In [21]: pp(urls)
      ['//samples.myfonts.net/a_91/u/af/5e840d069d35f2c8e5f7077bae7b1e.gif',
       '//samples.myfonts.net/e_91/u/d6/1d63ad993299d182ae19eddb2c41e1.gif',
       '//samples.myfonts.net/e_92/u/7c/15b8e24e4b077ae3b1c7a614afa8b5.gif',
       '//samples.myfonts.net/b_92/u/ce/63dffdda8581fc83f6fe20874714e7.gif',
       '//samples.myfonts.net/e_91/u/51/e8b7a0b5cccb2abf530b05e1d3fb04.gif',
       '//samples.myfonts.net/b_91/u/6f/a5f870c719dcf9961e753b9f4afd7e.gif',
       '//samples.myfonts.net/b_92/u/7c/94d652e4f146801e3c81f694898e07.gif',
       '//samples.myfonts.net/b_91/u/47/39fa3ab779cabd1068abbca7ce98c5.gif']
      

      您唯一需要传递的是 i[]: 值,其余的可用于更改大小、背景颜色等。

      因此,如果您不关心更改 bg、fg 或大小等并仅使用 bs4 和请求来获取所有名称,则可以从 search-result-item 类中获取字体名称并使用这些构造 Ajax 请求:

      In [1]: import requests
      
      In [2]: from bs4 import BeautifulSoup
      
      In [3]: r = requests.get("http://www.myfonts.com/fonts/fort-foundry/gin/")
      
      In [4]: soup = BeautifulSoup(r.content, "lxml")
      
      # creates fort-foundry/gin/regular,,720" etc..
      In [5]: fonts = ["{},,720".format(a["href"].strip("/").split("/", 1)[1]) 
                         for a in soup.select("div .search-result-item h4 a[href]")]
      
      In [6]: data = {
         ...:     "i[]": fonts
         ...:      }
      
      In [7]: js = requests.post("http://www.myfonts.com/ajax-server/testdrive_new-ajax.php", data=data).json()
      
      In [8]: urls = [img["src"] for img in BeautifulSoup("".join(js.values()),"lxml").select("img[src]")]
      
      In [9]: 
      
      In [9]: from pprint import  pprint as pp
      
      In [10]: pp(urls)
      ['//samples.myfonts.net/b_91/u/06/64bdafe9368dd401df4193a7608028.gif',
       '//samples.myfonts.net/b_92/u/06/b8ad49c563d310a97147d8220f55ab.gif',
       '//samples.myfonts.net/a_91/u/e7/8f84ce98f19e3f91ddc15304d636e7.gif',
       '//samples.myfonts.net/e_91/u/71/9769a1ab626429d63d3c779fcaa3d7.gif',
       '//samples.myfonts.net/b_92/u/65/fe416f15ea94b1f8603ddc675fd638.gif',
       '//samples.myfonts.net/b_91/u/5d/3ced9e71910bc411a0d76316d18df1.gif',
       '//samples.myfonts.net/e_92/u/cd/0df987a72bb0a43cba29b38c16b7a5.gif',
       '//samples.myfonts.net/e_91/u/88/3f80a1108fd0a075c69b09e9c21a8d.gif']
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2014-02-22
        • 2017-08-11
        • 1970-01-01
        • 1970-01-01
        • 2013-01-09
        • 2017-03-26
        • 2020-03-27
        • 2019-06-25
        相关资源
        最近更新 更多