python UnicodeEncodeError > 如何简单地删除麻烦的 unicode 字符？答案

【问题标题】：python UnicodeEncodeError > How can I simply remove troubling unicode characters?python UnicodeEncodeError > 如何简单地删除麻烦的 unicode 字符？
【发布时间】：2011-07-11 07:43:57
【问题描述】：

这就是我所做的......

>>> soup = BeautifulSoup (html)
>>> soup
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 96953: ordinal not in range(128)
>>> 
>>> soup.find('div')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 11035: ordinal not in range(128)
>>> 
>>> soup.find('span')
<span id="navLogoPrimary" class="navSprite"><span>amazon.com</span></span>
>>>

如何简单地从 html 中删除麻烦的 unicode 字符？
或者有没有更清洁的解决方案？

【问题讨论】：

标签： python parsing unicode html-parsing

【解决方案1】：

试试这个方法： soup = BeautifulSoup (html.decode('utf-8', 'ignore'))

【讨论】：

没用！这是发生了什么.. >>> html.decode('utf-8', 'strip') Traceback（最近一次调用最后）：..... LookupError: unknown error handler name 'strip' >>> >>> html.decode('utf-8') Traceback (最近一次调用最后一次): ..... UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 98071: unexpected code byte >>>
非常抱歉，“忽略”而不是“剥离”。我还建议阅读 Unicode HOWTO docs.python.org/howto/unicode.html

【解决方案2】：

我遇到了同样的问题，花了几个小时。请注意，每当解释器必须显示内容时就会发生错误，这是因为解释器试图转换为 ascii，从而导致问题。看看这里的最佳答案：

UnicodeEncodeError with BeautifulSoup 3.1.0.1 and Python 2.5.2

【讨论】：

【解决方案3】：

您看到的错误是由于repr(soup)尝试混合使用 Unicode 和字节串。混合 Unicode 和字节串经常会导致错误。

比较：

>>> u'1' + '©'
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

还有：

>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'

这是一个类的例子：

>>> class A:
...     def __repr__(self):
...         return u'copyright ©'.encode('utf-8')
... 
>>> A()
copyright ©
>>> class B:
...     def __repr__(self):
...         return u'copyright ©'
... 
>>> B()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
...     def __repr__(self):
...         return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)

BeautifulSoup 也会发生类似的事情：

>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)

解决方法：

>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'

【讨论】：

【解决方案4】：

首先，“麻烦”的 unicode 字符可能是某些语言中的字母，但假设您不必担心非英语字符，那么您可以使用 python 库将 unicode 转换为 ansi。看看这个问题的答案： How do I convert a file's format from Unicode to ASCII using Python?

那里接受的答案似乎是一个很好的解决方案（我事先并不知道）。

【讨论】：

该解决方案对我不起作用，因为 html 不是 unicode，它只是 str [>>> unicodedata.normalize('NFKD', html).encode('ascii','ignore')回溯（最近一次调用最后）：文件“”，第 1 行，在类型错误：normalize() 参数 2 必须是 unicode，而不是 str]