【问题标题】:Parsing errors with BeautifulSoup4 & Python 3.3使用 BeautifulSoup4 和 Python 3.3 解析错误
【发布时间】:2013-02-15 03:25:44
【问题描述】:

运行此代码:

from bs4 import BeautifulSoup
soup = BeautifulSoup (open("my.html"))
print(soup.prettify())

产生这个错误:

Traceback (most recent call last):
  File "soup.py", line 5, in <module>
    print(soup.prettify())
  File "C:\Python33\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u25ba' in position
9001: character maps to <undefined>

然后我尝试了:

print(soup.encode('UTF-8').prettify())

但是由于使用字节对象的字符串格式化而失败:

Traceback (most recent call last):
  File "soup.py", line 11, in <module>
    print(soup.encode('UTF-8').prettify())
AttributeError: 'bytes' object has no attribute 'prettify'

不知道如何解决这个问题。任何意见将不胜感激。

【问题讨论】:

  • 尝试先从字节解码字符串:bytes.decode(my.html)
  • 我无法用漂亮的汤来完成这项工作(AttributeError: 'str' object has no attribute...)

标签: python parsing encoding python-3.x beautifulsoup


【解决方案1】:

您的 (Windows) 控制台正在使用 cp437 编码,并且该编码不支持汤中有一个字符。默认是在这种情况下抛出异常,但是你可以改变它。

import sys,io
from bs4 import BeautifulSoup
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,'cp437','backslashreplace')
soup = BeautifulSoup (open("my.html"))
print(soup.prettify())

或者,将汤写入文件并使用支持编码的编辑器读取:

# On Windows, utf-8-sig will allow the file to be read by Notepad.
with open('out.txt','w',encoding='utf-8-sig') as f:
   f.write(soup.prettify())

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-12-18
    • 2020-04-05
    • 1970-01-01
    • 1970-01-01
    • 2018-09-02
    • 2019-07-13
    • 2014-01-16
    相关资源
    最近更新 更多