UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\u2026'答案

【问题标题】：UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026'UnicodeEncodeError: 'ascii' 编解码器无法编码字符 u'\u2026'
【发布时间】：2013-04-21 12:40:47
【问题描述】：

我正在学习 urllib2 和 Beautiful Soup，在第一次测试中遇到如下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

似乎有很多关于此类错误的帖子，我已经尝试了我能理解的解决方案，但似乎有 22 个问题，例如：

我想打印post.text（其中 text 是一个漂亮的汤方法，它只返回文本）。 str(post.text) 和 post.text 产生 unicode 错误（在右撇号的 ' 和 ... 上）。

所以我在str(post.text)上方添加post = unicode(post)，然后我得到：

AttributeError: 'unicode' object has no attribute 'text'

我也试过(post.text).encode() 和(post.text).renderContents()。后者产生错误：

AttributeError: 'unicode' object has no attribute 'renderContents'

然后我尝试了str(post.text).renderContents() 并得到了错误：

AttributeError: 'str' object has no attribute 'renderContents'

如果我可以在文档顶部的某个位置定义 'make this content 'interpretable'' 并且仍然可以访问所需的 text 函数，那就太好了。

更新：建议后：

如果我在str(post.text) 上方添加post = post.decode("utf-8")，我得到：

TypeError: unsupported operand type(s) for -: 'str' and 'int'

如果我在str(post.text) 上方添加post = post.decode()，我会得到：

AttributeError: 'unicode' object has no attribute 'text'

如果我在(post.text) 上方添加post = post.encode("utf-8")，我会得到：

AttributeError: 'str' object has no attribute 'text'

我尝试了print post.text.encode('utf-8') 并得到：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

为了尝试可行的方法，我从here 安装了适用于 Windows 的 lxml 并通过以下方式实现：

parsed_content = BeautifulSoup(original_content, "lxml")

根据http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.

这些步骤似乎没有什么不同。

我正在使用 Python 2.7.4 和 Beautiful Soup 4。

解决方案：

在对 unicode、utf-8 和 Beautiful Soup 类型有了更深入的了解之后，这与我的打印方法有关。我删除了我所有的 str 方法和连接，例如str(something) + post.text + str(something_else)，所以它是something, post.text, something_else，它似乎打印得很好，除非我在这个阶段对格式的控制较少（例如，在, 处插入空格）。

【问题讨论】：

Easy Q: UnicodeEncodeError: 'ascii' codec can't encode character的可能重复

标签： python python-2.7 unicode beautifulsoup urllib2

【解决方案1】：

您尝试过.decode() 或.decode("utf-8") 吗？

而且，我建议使用lxml 使用html5lib parser

http://lxml.de/html5parser.html

【讨论】：

我尝试了这些并将结果添加到原始帖子中。我刚刚学习了beautiful soup 和urllib2 的基础知识，我花了大约两个星期的时间，我真的需要再学习两个程序吗？ lxml 对我来说看起来很难，这就是为什么我选择美丽的汤，因为我更容易理解它。只是为了重申一下，我只是想获得“简单”的英语文本，它对右撇号的' 和... 等常见元素犹豫不决。

【解决方案2】：

在 Python 2 中，unicode 对象只有在可以转换为 ASCII 时才能打印。如果它不能用 ASCII 编码，你会得到那个错误。您可能希望对其进行显式编码，然后打印生成的str：

print post.text.encode('utf-8')

【讨论】：

+ '\n\n' + post.text.encode("utf-8") + '\n\n' UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)
然后，我正在打印 type(post) 以查看我正在使用的内容，它是 <class 'bs4.element.Tag'>。
@user1063287：encode 无法提出UnicodeDecodeError。回溯是什么？
@user1063287：我想我想说的是我需要更多关于它的上下文。我知道post.text.encode('utf-8') 本身应该可以正常工作；只是其他东西正在尝试对其进行解码，而您还没有显示正在执行此操作的代码。如果您可以编辑您的问题以包含更多有关其使用位置的上下文，那将很有帮助。
@user1063287：基本上，Python 2 发生了这种奇怪的str 和unicode 事情。如果将它们连接起来，那么它将隐式编码或解码（我忘了哪个）为 ASCII，以便它们是相同的类型。当然，当你处理非 ASCII 的东西时，你不能这样做：你必须明确地确保所有东西都是相同的类型。 Python 3 解决了这个问题，如果你混合它们而不是诉诸有时有效有时无效的行为，它会引发错误。

【解决方案3】：

    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

为我工作;-)

【讨论】：