合并unicode csv文件python 2.7答案

【问题标题】：Merging unicode csv files python 2.7合并unicode csv文件python 2.7
【发布时间】：2017-06-07 17:20:31
【问题描述】：

我有这样的代码sn-p：

import csv, sys, os
rootdir = sys.argv[1]
for root,subFolders, files in os.walk(rootdir):
    outfileName = rootdir + "\\root-dir.csv" # hardcoded path
    #for subdir in subFolders:
    for file in files:
        filePath = os.path.join(root, file)
        with open(filePath) as csvin:
            readfile = csv.reader(csvin, delimiter=',')
            with open(outfileName, 'a') as csvout:
                writefile = csv.writer(csvout, delimiter=',', lineterminator='\n')
                for row in readfile:
                    row.extend([file])
                    writefile.writerow(row)
                csvout.close()
            csvin.close()
print("Ready!")

它适用于 ascii 文件，但不适用于 unicode 版本。以下是自动运行日志文件的示例：https://cloud.mail.ru/public/6Gqc/MKjKaqs8B。我需要将其中一些文件合并到一个文件中。如何更改此代码以执行此操作？它需要适用于 python 2.7。

提前谢谢你！

【问题讨论】：

标签： python python-2.7 csv unicode

【解决方案1】：

python 文档中有一个很好的 reading/writing to unicode CSVs 示例。

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([s.encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

【讨论】：

我尝试使用它，但它没有正确读取数据。当我试图打开一个原始文件时，它抛出了一个错误：'utf8' codec can't decode byte 0xff in position 0. 当我从文件的开头删除了 2 个字节时，它抛出了一个错误：行包含 NULL 字节
@Oleg 听起来您的数据文件是 UTF-16，而不是 UTF-8。