在 python 中从 windows-1252 转换为 utf-8答案

【问题标题】：Convert from windows-1252 to utf-8 in python在 python 中从 windows-1252 转换为 utf-8
【发布时间】：2021-04-09 16:38:09
【问题描述】：

我想在python中从windows-1252转换为utf-8，我写了这段代码：

def encode(input_file, output_file):
        f = open(input_file, "r")
        data = f.read()
        f.close()

        # Convert from Windows-1252 to UTF-8
        encoded = data.encode('Windows-1252').decode('utf-8')
        with safe_open_w(output_file) as f:
            f.write(encoded)

但我有这个错误：

encoded = data.encode('Windows-1252').decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 5653: invalid continuation byte

我试图用这个元标记转换一个 html：

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

【问题讨论】：

如果是读写文本文件，只要在读写时传入编码作为参数即可：f = open(input_file, "r", encoding='Windows-1252')和with safe_open_w(output_file, encoding='utf8') as f:

标签： python encoding utf-8 decoding windows-1252

【解决方案1】：

您转换的方式错误。你想decode 从 cp1252 然后encode 成 UTF-8。但后者并不是真正必要的。 Python 已经为您做到了。

当你解码某些东西时，输入应该是bytes，结果是一个Python字符串。将字符串写入文件已经隐式转换它，实际上您也可以通过指定编码来执行相同的读取操作。

此外，将整个文件读入内存是不雅且浪费的。

with open(input_file, 'r', encoding='cp1252') as inp,\
        open(output_file, 'w', encoding='utf-8') as outp:
    for line in inp:
        outp.write(line)

【讨论】：

感谢您的建议，我试过了，但还是不行。我的问题与此相关：stackoverflow.com/questions/65534264/…
我猜你的真实文件是双重或三重编码的。如果没有看到实际的字节，我们无法真正分辨。也许看到meta.stackoverflow.com/questions/379403/…