'charmap' 编解码器在解析 HTML 时无法在 Python 中编码字符错误答案

【问题标题】：'charmap' codec can't encode character error in Python while parsing HTML'charmap' 编解码器在解析 HTML 时无法在 Python 中编码字符错误
【发布时间】：2016-11-14 06:49:24
【问题描述】：

这是我的代码：

dataFile = open('dataFile.html', 'w')
res = requests.get('site/pm=' + str(i))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
linkElems = soup.select('#content')
dataFile.write(str(linkElems[0]))

我还有一些其他代码，但这是我认为有问题的代码。我也尝试过使用：

dataFile.write(str(linkElems[0].decode('utf-8')))

但这不起作用并给出错误。

使用dataFile = open('dataFile.html', 'wb') 会出现错误：

a bytes-like object is required, not 'str'

【问题讨论】：

标签： python beautifulsoup python-3.5

【解决方案1】：

您打开文本文件时未指定编码：

dataFile = open('dataFile.html', 'w')

这告诉 Python 使用系统的默认编解码器。您尝试写入的每个 Unicode 字符串都将被编码为该编解码器，并且您的 Windows 系统未设置为默认 UTF-8。

明确指定编码：

dataFile = open('dataFile.html', 'w', encoding='utf8')

接下来，您信任 HTTP 服务器知道 HTML 数据使用什么编码。这通常根本没有设置，所以不要使用response.text！这不是 BeautifulSoup 的错误，您正在重新编码 Mojibake。当服务器未明确指定编码时，requests 库将默认为 text/* 内容类型使用 Latin-1 编码，因为 HTTP 标准规定这是默认值。

见Encoding section of the Advanced documentation：

Requests 唯一不会这样做的情况是，如果 HTTP 标头中没有明确的字符集并且Content-Type 标头包含text。 在这种情况下，RFC 2616 指定默认字符集必须为 ISO-8859-1。在这种情况下，请求遵循规范。如果您需要不同的编码，您可以手动设置Response.encoding 属性，或使用原始的Response.content。

我的大胆强调。

改为传递response.content 原始数据：

soup = bs4.BeautifulSoup(res.content, 'html.parser')

BeautifulSoup 4 通常可以很好地确定解析时使用的正确编码，无论是从 HTML <meta> 标记还是对所提供字节的统计分析。如果服务器确实提供了字符集，您仍然可以将其从响应中传递给 BeautifulSoup，但首先要测试 requests 是否使用了默认值：

encoding = res.encoding if 'charset' in res.headers.get('content-type', '').lower() else None
soup = bs4.BeautifulSoup(res.content, 'html.parser', encoding=encoding)

【讨论】：

还有一个问题。我现在得到了像 âˆ’ 这样的奇怪字符，而不是常规的 -。
他们有没有办法以二进制格式存储这些数据？
@SanJeetSingh：您正在使用response.text。不要，这将使用默认的 Latin-1 作为 HTTP 文本响应而没有 Content-type 编码集，这总是错误的。使用 response.content 并将其留给 BeautifulSoup 来确定要使用的编码。
@SanJeetSingh：你有text，就这样处理吧，修复你使用response.text引入的Mojibake即可。
@SanJeetSingh：二进制格式对于包含非 ascii 字符的文本数据没有多大意义。并且减号（unicode 字符 U+2212）将很难在非 unicode 字符集中找到 - âˆ’ 是 '\xe2\x88\x92'，u'\u2212' 的 utf8 表示形式