【问题标题】:How to handle when webpage does not load on BeautifulSoupBeautifulSoup 无法加载网页时如何处理
【发布时间】:2011-10-27 21:02:02
【问题描述】:

目前,如果在检索网页时出错,soup 将不会填充页面,而是从 beautifulsoup 获取默认返回。

我正在寻找一种方法来检查这一点,这样如果在获取网页时出现错误,我可以跳过一大段代码,例如

if soup:
  do stuff

但我不想一起终止。新手查询的请求。

def getwebpage(address):
  try:
      user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
      headers = { 'User-Agent' : user_agent }
      req = urllib2.Request(address, None, headers)
      web_handle = urllib2.urlopen(req)
  except urllib2.HTTPError, e:
      error_desc = BaseHTTPServer.BaseHTTPRequestHandler.responses[e.code][0]
      appendlog('HTTP Error: ' + str(e.code) + ': ' + address)
      return
  except urllib2.URLError, e:
      appendlog('URL Error: ' + e.reason[1] + ': ' + address)
      return
  except:
      appendlog('Unknown Error: ' + address)
      return
  return web_handle


def test():
  soup = BeautifulSoup(getwebpage('http://doesnotexistblah.com/'))
  print soup

  if soup:
    do stuff

test()

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    结构化代码,使一个函数封装从 url 检索数据的整个过程,另一个封装该数据的处理:

    import urllib2, httplib
    from BeautifulSoup import BeautifulSoup
    
    def append_log(message):
        print message
    
    def get_web_page(address):
        try:
            user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
            headers = { 'User-Agent' : user_agent }
            request = urllib2.Request(address, None, headers)
            response = urllib2.urlopen(request, timeout=20)
            try:
                return response.read()
            finally:
                response.close()
        except urllib2.HTTPError as e:
            error_desc = httplib.responses.get(e.code, '')
            append_log('HTTP Error: ' + str(e.code) + ': ' +
                      error_desc + ': ' + address)
        except urllib2.URLError as e:
            append_log('URL Error: ' + e.reason[1] + ': ' + address)
        except Exception as e:
            append_log('Unknown Error: ' + str(e) + address)
    
    def process_web_page(data):
        if data is not None:
            print BeautifulSoup(data)
        else:
            pass # do something else
    
    data = get_web_page('http://doesnotexistblah.com/')
    process_web_page(data)
    
    data = get_web_page('http://docs.python.org/copyright.html')
    process_web_page(data)
    

    【讨论】:

      【解决方案2】:
      soup = getwebpage('http://doesnotexistblah.com/')
      if soup is not None:
          soup = BeautifulSoup(soup)
      

      这就是你想要的吗?

      【讨论】:

      • 是的,不是的,这就是我想要的,但汤永远不会没有,即使喂了一个错误的地址......
      • 当你只是从getwebpagereturn,它是None
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-10-04
      • 1970-01-01
      • 1970-01-01
      • 2018-01-18
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多