【问题标题】:Merging unicode csv files python 2.7合并unicode csv文件python 2.7
【发布时间】:2017-06-07 17:20:31
【问题描述】:

我有这样的代码sn-p:

import csv, sys, os
rootdir = sys.argv[1]
for root,subFolders, files in os.walk(rootdir):
    outfileName = rootdir + "\\root-dir.csv" # hardcoded path
    #for subdir in subFolders:
    for file in files:
        filePath = os.path.join(root, file)
        with open(filePath) as csvin:
            readfile = csv.reader(csvin, delimiter=',')
            with open(outfileName, 'a') as csvout:
                writefile = csv.writer(csvout, delimiter=',', lineterminator='\n')
                for row in readfile:
                    row.extend([file])
                    writefile.writerow(row)
                csvout.close()
            csvin.close()
print("Ready!")

它适用于 ascii 文件,但不适用于 unicode 版本。 以下是自动运行日志文件的示例:https://cloud.mail.ru/public/6Gqc/MKjKaqs8B。我需要将其中一些文件合并到一个文件中。 如何更改此代码以执行此操作?它需要适用于 python 2.7。

提前谢谢你!

【问题讨论】:

    标签: python python-2.7 csv unicode


    【解决方案1】:

    python 文档中有一个很好的 reading/writing to unicode CSVs 示例。

    class UnicodeReader:
        """
        A CSV reader which will iterate over lines in the CSV file "f",
        which is encoded in the given encoding.
        """
    
        def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
            f = UTF8Recoder(f, encoding)
            self.reader = csv.reader(f, dialect=dialect, **kwds)
    
        def next(self):
            row = self.reader.next()
            return [unicode(s, "utf-8") for s in row]
    
        def __iter__(self):
            return self
    
    class UnicodeWriter:
        """
        A CSV writer which will write rows to CSV file "f",
        which is encoded in the given encoding.
        """
    
        def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
            # Redirect output to a queue
            self.queue = cStringIO.StringIO()
            self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
            self.stream = f
            self.encoder = codecs.getincrementalencoder(encoding)()
    
        def writerow(self, row):
            self.writer.writerow([s.encode("utf-8") for s in row])
            # Fetch UTF-8 output from the queue ...
            data = self.queue.getvalue()
            data = data.decode("utf-8")
            # ... and reencode it into the target encoding
            data = self.encoder.encode(data)
            # write to the target stream
            self.stream.write(data)
            # empty queue
            self.queue.truncate(0)
    
        def writerows(self, rows):
            for row in rows:
                self.writerow(row)
    

    【讨论】:

    • 我尝试使用它,但它没有正确读取数据。当我试图打开一个原始文件时,它抛出了一个错误:'utf8' codec can't decode byte 0xff in position 0. 当我从文件的开头删除了 2 个字节时,它抛出了一个错误:行包含 NULL 字节
    • @Oleg 听起来您的数据文件是 UTF-16,而不是 UTF-8。
    猜你喜欢
    • 2015-01-22
    • 2017-05-22
    • 2020-09-03
    • 2020-05-22
    • 2014-05-09
    • 2013-07-07
    • 1970-01-01
    • 2020-03-18
    • 2012-08-12
    相关资源
    最近更新 更多