【发布时间】:2021-03-12 10:17:31
【问题描述】:
我是 Python 的新手,通常是编码方面的新手。非常感谢任何帮助。
我在具有多种编码的单个目录中有 3000 多个文本文件。我需要将它们转换为单一编码(例如 utf8)以进行进一步的 NLP 工作。当我使用 shell 检查这些文件的类型时,我发现了以下编码:
Algol 68 source text, ISO-8859 text, with very long lines
Algol 68 source text, Little-endian UTF-16 Unicode text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
ASCII text
ASCII text, with very long lines
data
diff output text, ASCII text
ISO-8859 text, with very long lines
ISO-8859 text, with very long lines, with LF, NEL line terminators
Little-endian UTF-16 Unicode text, with very long lines
Non-ISO extended-ASCII text
Non-ISO extended-ASCII text, with very long lines
Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
UTF-8 Unicode (with BOM) text, with CRLF line terminators
UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators
UTF-8 Unicode text, with very long lines, with CRLF line terminators
任何想法如何将具有上述编码的文本文件转换为具有 utf-8 编码的文本文件?
【问题讨论】: