如何将源 HTML 代码传入 BeautifulSoup？答案

【问题标题】：How to pass in the source HTML code into BeautifulSoup?如何将源 HTML 代码传入 BeautifulSoup？
【发布时间】：2020-07-20 08:25:52
【问题描述】：

我想从网站上的搜索中抓取结果。搜索词出现在 URL 中，所以我只需导入 urllib.request 并执行

source = urllib.request.urlopen('https://....').read()

然后我将它传递给 BeautifulSoup 构造函数

soup = BeautifulSoup(source)

我想 find_all('div') div 标签。但是，看起来您只能将 html 代码传递给 BeautifulSoup 构造函数。 urllib.request.urlopen('https://...').read() 似乎返回页面源，而不是检查元素。如何将检查元素传入 BeautifulSoup 构造函数？

【问题讨论】：

但是，您似乎只能将 html 代码传递给 BeautifulSoup 构造函数。 是的，因为 BeautifulSoup 是一个 HTML 解析器，这是意料之中的。 但是，您似乎只能将 html 代码传递给 BeautifulSoup 构造函数。 检查元素 是什么意思？请澄清您的问题，请参阅How to Ask、help center。
为什么在开始之前没有阅读documentation 或像样的tutorial？
当您右键单击页面时，可以选择查看页面源代码和进行检查。检查时看到的 HTML 是我感兴趣的。
@Pedro Lobito 因为文档并没有真正指定传递给 BeautifulSoup 的 HTML 代码类型之间的区别。教程链接很好，但是将 requests.get(url).content 传递给 BeautifulSoup 一定会给我带来问题，因为我的 url 有过多的 div 标签，但 find_all('div') 什么也没给我。当我执行 requests.get(url).content 时，它会显示一堆 javascript 代码。我只想要纯 HTML，就像您在检查页面时看到的那样。

标签： python web-scraping beautifulsoup urllib

【解决方案1】：

BeautifulSoup 构造函数接受两个字符串参数：

要解析的 HTML 字符串。（可选）解析器的名称。

来自： http://www.compjour.org/warmups/govt-text-releases/intro-to-bs4-lxml-parsing-wh-press-briefings/

您不能将查找值传递给构造函数，只需使用我之前回答中提到的 findAll。

Find a specific tag with BeautifulSoup

编辑：从您的 cmets 中读取是我认为您正在寻找的内容：

from bs4 import BeautifulSoup
html_doc = urllib.request.urlopen('https://....').read()
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

查看：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

【讨论】：