【发布时间】:2020-04-14 02:46:28
【问题描述】:
我正在尝试通过 BeautifulSoup 使用 lxml 解析一个非常长的 html 文件。我知道 html 文件的字符编码是 UTF-8 with BOM 但每当我尝试运行 contents = f.read() 时,我都会收到以下错误:'charmap' codec can't decode byte 0x8d in position 33222: character maps to <undefined>
这是第一个(并且有问题) 我的代码的一部分:
from bs4 import BeautifulSoup
with open("doc.html", "r") as f:
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
print(soup.h2)
print(soup.head)
print(soup.li)
这是错误显示:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-1-4805460879e0> in <module>
3 with open("doc.html", "r") as f:
4
----> 5 contents = f.read()
6
7 soup = BeautifulSoup(contents, 'lxml')
~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33222: character maps to <undefined>
【问题讨论】:
-
您使用哪个
IDE来运行您的代码?以及如何运行代码?using python code.py?你是用Windows还是Unix,你能用with open("doc.html", "r", encoding="UTF-8") as f:试试吗,否则用LATIN-1 -
这确实有效,谢谢。 :)
标签: python html encoding beautifulsoup lxml