BeautifulSoup 解析 HTML 失败时返回什么值？答案

【问题标题】：What value does BeautifulSoup return when it fails to parse HTML?BeautifulSoup 解析 HTML 失败时返回什么值？
【发布时间】：2017-09-03 10:01:00
【问题描述】：

我正在尝试抓取网页并尝试提取最高级别 3 的 url。我的代码如下：

import lxml.html
import urllib.request
from bs4 import BeautifulSoup

stopLevel = 3
rootUrls = ['http://ps.ucdavis.edu/']

foundUrls = {}
for rootUrl in rootUrls:
    foundUrls.update({rootUrl : {'Level':0, 'Parent':'N/A'}})

def getProtocolAndDomainName(url):
    protocolAndOther = url.split('://')
    # splitting url by '://' and retrun a list
    ptorocol = protocolAndOther[0]
    domainName = protocolAndOther[1].split('/')[0]
    # this will only return 'https://xxxxx.com'
    return ptorocol + '://' + domainName

foundUrls = {}
for rootUrl in rootUrls:
    foundUrls.update({rootUrl : 0})

def getProtocolAndDomainName(url):
    protocolAndOther = url.split('://')
    ptorocol = protocolAndOther[0]
    domainName = protocolAndOther[1].split('/')[0]
    return ptorocol + '://' + domainName


def crawl(urls, stopLevel = 5, level=1):
    nextUrls = []
    if (level <= stopLevel):
        for url in urls:
            # need to handle urls (e.g., https) that cannot be read
            try:
                openedUrl = urllib.request.urlopen(url).read()
                soup = BeautifulSoup(openedUrl, 'html.parser')
            except:
                print('cannot read for :' + url)

            for a in soup.find_all('a', href=True):
                href = a['href']
                if href is not None:
                    # for the case of a link is relative path
                    if '://' not in href:
                        href = getProtocolAndDomainName(url) + href
                    # check url has been already visited or not
                    if href not in foundUrls:
                        foundUrls.update({href: {'Level':level, 
                        'Parent':url}})
                        nextUrls.append(href)
        # recursive call
        crawl(nextUrls, stopLevel, level + 1)

crawl(rootUrls, stopLevel)
print(foundUrls)

运行代码后，将显示错误消息UnboundLocalError: local variable 'soup' referenced before assignment。我知道这个问题的发生是因为BeautifulSoup 无法解析openedUrl，因此，这个局部变量soup 没有定义，这进一步导致这个循环失败。为了解决这个问题，我的第一个解决方案是将soup 设置为global，就在def crawl(urls, stopLevel = 5, level=1): 的下方global soup。但是，有人告诉我，这根本不能解决问题。我的第二个解决方案是在BeautifulSoup 无法解析时使用if...continue 保持循环运行，但我现在面临的问题是，无论我设置if soup == ' ' 还是if soup == None，它仍然不起作用。我想知道 BeautifulSoup 在失败时返回什么值。任何人都可以帮忙吗？或者有人有其他解决方案吗？非常感谢。

【问题讨论】：

标签： python-3.x beautifulsoup web-crawler

【解决方案1】：

通常当BeautifulSoup 无法解析文档时，仍会返回bs4 对象，但会打印警告。如果您向它提供不是字符串或缓冲区的东西，它将引发TypeError。

在这种情况下，异常很可能是由urllib 而不是BeautifulSoup 引发的，但是您捕获它并继续执行脚本而无需真正处理异常。
这会导致下一行出现NameError 异常，因为soup 是在失败的try 块中定义的，因此未定义soup。

您可以使用continue 作为快速修复，这样您的循环将移至下一个项目。

try:
    openedUrl = urllib2.urlopen(url).read()
    soup = BeautifulSoup(openedUrl, 'html.parser')
except urllib.error.HTTPError as e:
    print('HTTP Error ' + str(e.code) + ' for: ' + url)
    continue
except KeyboardInterrupt: 
    print('Script terminated by user.')
    return
except Exception as e:
    print(e) 
    continue

【讨论】：

非常感谢！一个后续问题：您提到多个except 块，您的意思是我应该为openedUrl = urllib.request.urlopen(url).read() 设置一个块，另一个为soup = BeautifulSoup(openedUrl, 'html.parser') 设置一个块？
不完全是。您可以为BeautifulSoup 使用另一个try-except 块，但我认为这没有必要，因为它非常容错。但是urllib 可能会引发几个您可能想要单独处理的异常，或者您可能想用KeyboardInterrupt (ctrl-c) 等终止您的脚本。我已经更新了我的帖子给你一个例子。只要确保导入urllib.error

【解决方案2】：

您应该在打开请求后调用read() 以从 URL 获取 HTML 结果。

openedUrl = urllib.request.urlopen(url).read()

更新：该站点正在阻止 urllib 的内置用户代理，以解决您应该使用 Firefox 的用户代理屏蔽它。

        try:
            user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'}
            req = urllib.request.Request(url=url, headers=user_agent)
            openedUrl = urllib.request.urlopen(req)
            soup = BeautifulSoup(openedUrl, 'html.parser')
        except:
            print('cannot read for :' + url)

【讨论】：

BeautifulSoup 仍然无法打开其中一个 HTML，即使我在代码中添加了 .read()。
我添加了一个更新，让我知道它是否有效。谢谢！
还是不行。我想也许最好的方法是避免汤没有返回值的情况。有什么想法吗？
这很奇怪。基本上，BeautifulSoup 只接受字符串、unicode 或字节，如果你用这三种以外的任何其他数据类型提供它，它会抛出异常。
我想我找到了解决方案。 BeautifulSoup 无法向soup 返回值的原因是因为urllib.request.urlopen(url) 无法先解析javascript。我将模块urllib.request 替换为selenium。稍后我会发布详细的代码。还是谢谢！