urlopen('http.....').read() 中的 read() 有什么作用？ [urllib]答案

【问题标题】：what does read() in urlopen('http.....').read() do? [urllib]urlopen('http.....').read() 中的 read() 有什么作用？ [urllib]
【发布时间】：2016-06-22 04:12:22
【问题描述】：

您好，我正在阅读“Web Scraping with Python (2015)”。我看到了以下两种打开url的方式，使用.read()和不使用.read()。见bs1和bs2

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs1 = BeautifulSoup(html.read(), 'html.parser')

html = urlopen('http://web.stanford.edu/~zlotnick/TextAsData/Web_Scraping_with_Beautiful_Soup.html')
bs2 = BeautifulSoup(html, 'html.parser')

bs1 == bs2 # true


print(bs1.prettify()[0:100])
print(bs2.prettify()[0:100]) # prints same thing

那么.read() 是多余的吗？谢谢

Web scpraing with python p7 上的代码：（使用.read()）

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())

p15 上的代码（不含.read()）

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html)

【问题讨论】：

除了上面的答案，我建议你尝试使用 requests 库来处理 HTTP 请求 docs.python-requests.org/en/latest 你会更好地控制 HTTP 响应
谢谢@A.Romeu 你能给我推荐一些帖子以获取更多信息吗？我确实需要在下一步中调整表格并获取响应网页，我计划在其中使用mechanize
在我发给你的链接上，有很多关于如何使用它的信息，在“用户指南”部分。可以直接用docs.python-requests.org/en/latest/user/quickstart/…开始

标签： python beautifulsoup urllib

【解决方案1】：

引用BS docs:

要解析文档，请将其传递给 BeautifulSoup 构造函数。你可以传入一个字符串或一个打开的文件句柄：

当您使用 .read() 方法时，您使用的是“字符串”接口。如果不是，则使用“文件句柄”接口。

实际上它的工作方式相同（尽管 BS4 可能会以惰性方式读取类似文件的对象）。在您的情况下，整个内容被读取到字符串对象（它可能会不必要地消耗更多的内存）。

【讨论】：

【解决方案2】：

urllib.request.urlopen 返回一个类似文件的对象，它的read 方法将返回该url的响应体。

BeautifulSoup 构造函数接受字符串或打开的文件句柄，所以是的，read() 在这里是多余的。

【讨论】：

【解决方案3】：

没有 BeautifulSoup 模块

.read() 在您不使用“BeautifulSoup”模块时很有用，因此在这种情况下它是非冗余的。只有当你使用 .read() 你会得到 html 内容，没有它你只会得到 .urlopen() 返回的对象

使用 BeautifulSoup 模块

BS模块有2个构造函数用于这个特性，一个接受String，另一个接受.urlopen(some-site)返回的对象

【讨论】：