您看到的错误是由于repr(soup)尝试混合使用 Unicode 和字节串。混合 Unicode 和字节串经常会导致错误。
比较:
>>> u'1' + '©'
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
还有:
>>> u'1' + u'©'
u'1\xa9'
>>> '1' + u'©'
u'1\xa9'
>>> '1' + '©'
'1\xc2\xa9'
这是一个类的例子:
>>> class A:
... def __repr__(self):
... return u'copyright ©'.encode('utf-8')
...
>>> A()
copyright ©
>>> class B:
... def __repr__(self):
... return u'copyright ©'
...
>>> B()
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128) #' workaround highlighting bug
>>> class C:
... def __repr__(self):
... return repr(A()) + repr(B())
...
>>> C()
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "<input>", line 3, in __repr__
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 10: ordi
nal not in range(128)
BeautifulSoup 也会发生类似的事情:
>>> html = """<p>©"""
>>> soup = BeautifulSoup(html)
>>> repr(soup)
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 3: ordin
al not in range(128)
解决方法:
>>> unicode(soup)
u'<p>\xa9</p>'
>>> str(soup)
'<p>\xc2\xa9</p>'
>>> soup.encode('utf-8')
'<p>\xc2\xa9</p>'