【问题标题】:Extract image links from the webpage using Python使用 Python 从网页中提取图片链接
【发布时间】:2012-07-06 05:16:14
【问题描述】:

所以我想获取此页面上的所有图片(nba 球队)。 http://www.cbssports.com/nba/draft/mock-draft

但是,我的代码提供的远不止这些。它给了我,

<a href="/nba/teams/page/ORL"><img src="http://sports.cbsimg.net/images/nba/logos/30x30/ORL.png" alt="Orlando Magic" width="30" height="30" border="0" /></a>

如何缩短它只给我,http://sports.cbsimg.net/images/nba/logos/30x30/ORL.png.

我的代码:

import urllib2
from BeautifulSoup import BeautifulSoup
# or if your're using BeautifulSoup4: 
# from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read())

rows = soup.findAll("table", attrs = {'class': 'data borderTop'})[0].tbody.findAll("tr")[2:]

for row in rows:
  fields = row.findAll("td")
  if len(fields) >= 3:
    anchor = row.findAll("td")[1].find("a")
    if anchor:
      print anchor

【问题讨论】:

    标签: python image python-2.7 web-scraping


    【解决方案1】:

    您可以使用此函数从 url 获取所有图像 url 的列表。

    #
    #
    # get_url_images_in_text()
    #
    # @param html - the html to extract urls of images from him.
    # @param protocol - the protocol of the website, for append to urls that not start with protocol.
    #
    # @return list of imags url.
    #
    #
    def get_url_images_in_text(html, protocol):
        urls = []
        all_urls = re.findall(r'((http\:|https\:)?\/\/[^"\' ]*?\.(png|jpg))', html, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
        for url in all_urls:
            if not url[0].startswith("http"):
                urls.append(protocol + url[0])
            else:
                urls.append(url[0])
    
        return urls
    
    #
    #
    # get_images_from_url()
    #
    # @param url - the url for extract images url from him. 
    #
    # @return list of images url.
    #
    #
    def get_images_from_url(url):
        protocol = url.split('/')[0]
        resp = requests.get(url)
        return get_url_images_in_text(resp.text, protocol)
    

    【讨论】:

      【解决方案2】:

      要保存http://www.cbssports.com/nba/draft/mock-draft上的所有图像,

      import urllib2
      import os
      from BeautifulSoup import BeautifulSoup
      URL = "http://www.cbssports.com/nba/draft/mock-draft"
      default_dir = os.path.join(os.path.expanduser("~"),"Pictures")
      opener = urllib2.build_opener()
      urllib2.install_opener(opener)
      soup = BeautifulSoup(urllib2.urlopen(URL).read())
      imgs = soup.findAll("img",{"alt":True, "src":True})
      for img in imgs:
          img_url = img["src"]
          filename = os.path.join(default_dir, img_url.split("/")[-1])
          img_data = opener.open(img_url)
          f = open(filename,"wb")
          f.write(img_data.read())
          f.close()
      

      要在http://www.cbssports.com/nba/draft/mock-draft 上保存任何特定图像, 使用

      soup.find("img",{"src":"image_name_from_source"})
      

      【讨论】:

        【解决方案3】:

        我知道这可能是“创伤性的”,但是对于那些自动生成的页面,您只想将那些该死的图像拿走并且永远不会回来,采用所需模式的快速正则表达式往往是我的选择(不依赖 Beautiful Soup 是一个很大的优势):

        import urllib, re
        
        source = urllib.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()
        
        ## every image name is an abbreviation composed by capital letters, so...
        for link in re.findall('http://sports.cbsimg.net/images/nba/logos/30x30/[A-Z]*.png', source):
            print link
        
        
            ## the code above just prints the link;
            ## if you want to actually download, set the flag below to True
        
            actually_download = False
            if actually_download:
                filename = link.split('/')[-1]
                urllib.urlretrieve(link, filename)
        

        希望这会有所帮助!

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2019-04-12
          • 1970-01-01
          • 1970-01-01
          • 2022-01-13
          • 2014-03-17
          • 2011-04-14
          • 2023-03-18
          • 1970-01-01
          相关资源
          最近更新 更多