Python - 传递 url 与 HTTPResponse 对象答案

【问题标题】：Python - Passing urls vs. an HTTPResponse objectPython - 传递 url 与 HTTPResponse 对象
【发布时间】：2012-03-27 22:01:14
【问题描述】：

我有一个 URL 列表，我想从中获取属性。 Python新手，请见谅。 Windows 7，64 位。 Python 3.2。

以下代码有效。 pblist 是一个由包含键 'short_url' 的 dicts 组成的列表。

for j in pblist[0:10]:
    base_url = j['short_url']
    if hasattr(BeautifulSoup(urllib.request.urlopen(base_url)), 'head') and \
        hasattr(BeautifulSoup(urllib.request.urlopen(base_url)).head, 'title'):
            print("Has head, title attributes.")
            try:
                j['title'] = BeautifulSoup(urllib.request.urlopen(base_url)).head.title.string.encode('utf-8')
            except AttributeError:
                print("Encountered attribute error on page, ", base_url)
                j['title'] = "Attribute error."
                pass

以下代码没有 - 例如，代码声称 BeautifulSoup 对象没有 head 和 title 属性。

for j in pblist[0:10]:
        base_url = j['short_url']
        page = urllib.request.urlopen(base_url)
        if hasattr(BeautifulSoup(page), 'head') and \
            hasattr(BeautifulSoup(page).head, 'title'):
                print("Has head, title attributes.")
                try:
                    j['title'] = BeautifulSoup(urllib.request.urlopen(base_url)).head.title.string.encode('utf-8')
                except AttributeError:
                    print("Encountered attribute error on page, ", base_url)
                    j['title'] = "Attribute error."
                    pass

为什么？在 BeautifulSoup 中将 url 传递给 urllib.request.urlopen 和传递 urllib.request.urlopen 返回的 HTTPResponse ojbect 有什么区别？

【问题讨论】：

标签： python url beautifulsoup

【解决方案1】：

urlopen() 提供的响应是一个类似文件的对象，这意味着默认情况下它的行为有点像一个迭代器——也就是说，一旦你读过它一次，你就不会再从中得到任何数据了（除非您明确重置它）。

因此，在第二个版本中，BeautifulSoup(page) 的第一次调用读取了page 中的所有数据，随后的调用没有更多数据要读取。

相反，您可以这样做：

page = urllib.request.urlopen(base_url)
page_content = page.read()
# ...
BeautifulSoup(page_content)
# ...
BeautifulSoup(page_content)

但即使这样也有点低效。相反，为什么不直接创建一个 BeautifulSoup 对象并传递它呢？

page = urllib.request.urlopen(base_url)
soup = BeautifulSoup(page)
# ...
# do something with soup
# ...
# do something with soup

您的代码，修改为使用单个汤对象：

for j in pblist[0:10]:
        base_url = j['short_url']
        page = urllib.request.urlopen(base_url)
        soup = BeautifulSoup(page)
        if hasattr(soup, 'head') and \
            hasattr(soup.head, 'title'):
                print("Has head, title attributes.")
                try:
                    j['title'] = soup.head.title.string.encode('utf-8')
                except AttributeError:
                    print("Encountered attribute error on page, ", base_url)
                    j['title'] = "Attribute error."
                    pass

【讨论】：