【问题标题】:What is the best way to handle a bad link given to BeautifulSoup?处理给 BeautifulSoup 的坏链接的最佳方法是什么?
【发布时间】:2010-10-01 23:01:56
【问题描述】:

我正在做一些事情,从美味中提取 url,然后使用这些 url 来发现相关的提要。

但是,delicious 中的某些书签不是 html 链接,会导致 BS 吐槽。基本上,如果 BS 获取链接并且它看起来不像 html,我想丢弃它。

现在,这就是我得到的。

trillian:Documents jauderho$ ./d2o.py "green data center" 
processing http://www.greenm3.com/
processing http://www.eweek.com/c/a/Green-IT/How-to-Create-an-EnergyEfficient-Green-Data-Center/?kc=rss
Traceback (most recent call last):
  File "./d2o.py", line 53, in <module>
    get_feed_links(d_links)
  File "./d2o.py", line 43, in get_feed_links
    soup = BeautifulSoup(html)
  File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "/Library/Python/2.5/site-packages/BeautifulSoup.py", line 1263, in _feed
    self.builder.feed(markup)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 314, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: u'</b  />', at line 739, column 1

更新:

耶希亚的回答成功了。作为参考,这里有一些获取内容类型的代码:

def check_for_html(link):
    out = urllib.urlopen(link)
    return out.info().getheader('Content-Type')

【问题讨论】:

    标签: python parsing beautifulsoup


    【解决方案1】:

    我只是简单地包装我的 BeautifulSoup 处理并查找 HTMLParser.HTMLParseError 异常

    import HTMLParser,BeautifulSoup
    try:
        soup = BeautifulSoup.BeautifulSoup(raw_html)
        for a in soup.findAll('a'):
            href = a.['href']
            ....
    except HTMLParser.HTMLParseError:
        print "failed to parse",url
    

    但除此之外,您可以在抓取页面时检查响应的内容类型,并在尝试解析之前确保它类似于 text/htmlapplication/xml+xhtml 或类似的东西。这应该可以避免大多数错误。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2010-09-06
      • 2014-01-25
      • 1970-01-01
      • 2012-09-18
      • 2011-05-20
      • 2016-05-26
      • 2018-12-03
      • 1970-01-01
      相关资源
      最近更新 更多