【问题标题】:Converting all text files with multiple encodings in a directory into a utf-8 encoded text files将一个目录下所有多种编码的文本文件转换成utf-8编码的文本文件
【发布时间】:2021-03-12 10:17:31
【问题描述】:

我是 Python 的新手,通常是编码方面的新手。非常感谢任何帮助。

我在具有多种编码的单个目录中有 3000 多个文本文件。我需要将它们转换为单一编码(例如 utf8)以进行进一步的 NLP 工作。当我使用 shell 检查这些文件的类型时,我发现了以下编码:

Algol 68 source text, ISO-8859 text, with very long lines
Algol 68 source text, Little-endian UTF-16 Unicode text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines
Algol 68 source text, Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
ASCII text
ASCII text, with very long lines
data
diff output text, ASCII text
ISO-8859 text, with very long lines
ISO-8859 text, with very long lines, with LF, NEL line terminators
Little-endian UTF-16 Unicode text, with very long lines
Non-ISO extended-ASCII text
Non-ISO extended-ASCII text, with very long lines
Non-ISO extended-ASCII text, with very long lines, with LF, NEL line terminators
UTF-8 Unicode (with BOM) text, with CRLF line terminators
UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators
UTF-8 Unicode text, with very long lines, with CRLF line terminators

任何想法如何将具有上述编码的文本文件转换为具有 utf-8 编码的文本文件?

【问题讨论】:

    标签: python encoding utf-8


    【解决方案1】:

    我遇到了和你一样的问题。 我用了两个步骤来解决这个问题。

    代码如下:

    import os, sys, codecs
    import chardet
    

    首先,使用chardet包来识别文本的编码。

    for text in os.listdir(path):
        txtPATH = os.path.join(path, text)
        txtPATH=str(txtPATH)
        
    
        f = open(txtPATH, 'rb')
        data = f.read()
        f_charInfo = chardet.detect(data)
        coding2=f_charInfo['encoding']
        coding=str(coding2)
        print(coding)
        data = f.read()
    

    其次,如果文本编码不是utf-8,则将文本改写为utf-8编码的目录。

            if not re.match(r'.*\.utf-8$', coding, re.IGNORECASE): 
            print(txtPATH)
            print(coding)
    
            with codecs.open(txtPATH, "r", coding) as sourceFile:
                contents = sourceFile.read()
                
                
                with codecs.open(txtPATH, "w", "utf-8") as targetFile:              
                    targetFile.write(contents)
    

    希望这能有所帮助!谢谢

    【讨论】:

      猜你喜欢
      • 2020-07-28
      • 2014-02-17
      • 1970-01-01
      • 2020-10-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多