使用 Beautiful Soup 保存实体进行刮擦答案

【问题标题】：Scrape using Beautiful Soup preserving   entities使用 Beautiful Soup 保存实体进行刮擦
【发布时间】：2023-03-29 13:19:01
【问题描述】：

我想从网上抓取一张表格并保留实体完整，以便我以后可以重新发布为 HTML。 BeautifulSoup 似乎正在将这些转换为空格。示例：

from bs4 import BeautifulSoup

html = "<html><body><table><tr>"
html += "<td>&nbsp;hello&nbsp;</td>"
html += "</tr></table></body></html>"

soup = BeautifulSoup(html)
table = soup.find_all('table')[0]
row = table.find_all('tr')[0]
cell = row.find_all('td')[0]

print cell

观察结果：

<td> hello </td>

要求的结果：

<td>&nbsp;hello&nbsp;</td>

【问题讨论】：

标签： python web-scraping beautifulsoup html-parsing html-entities

【解决方案1】：

在 bs4 convertEntities 中，BeautifulSoup 构造函数的参数不再被支持。 HTML 实体总是被转换成相应的 Unicode 字符（参见docs）。

根据文档，您需要使用输出格式化程序，如下所示：

print soup.find_all('td')[0].prettify(formatter="html")

【讨论】：