Python中的CSV解析答案

【问题标题】：CSV parsing in PythonPython中的CSV解析
【发布时间】：2014-04-22 19:49:44
【问题描述】：

我想解析一个 csv 文件，格式如下：

Test Environment INFO for 1 line.
Test,TestName1,
TestAttribute1-1,TestAttribute1-2,TestAttribute1-3
TestAttributeValue1-1,TestAttributeValue1-2,TestAttributeValue1-3

Test,TestName2,
TestAttribute2-1,TestAttribute2-2,TestAttribute2-3
TestAttributeValue2-1,TestAttributeValue2-2,TestAttributeValue2-3

Test,TestName3,
TestAttribute3-1,TestAttribute3-2,TestAttribute3-3
TestAttributeValue3-1,TestAttributeValue3-2,TestAttributeValue3-3

Test,TestName4,
TestAttribute4-1,TestAttribute4-2,TestAttribute4-3
TestAttributeValue4-1-1,TestAttributeValue4-1-2,TestAttributeValue4-1-3
TestAttributeValue4-2-1,TestAttributeValue4-2-2,TestAttributeValue4-2-3
TestAttributeValue4-3-1,TestAttributeValue4-3-2,TestAttributeValue4-3-3

并希望将其转换为制表符分隔格式，如下所示：

TestName1
TestAttribute1-1 TestAttributeValue1-1
TestAttribute1-2 TestAttributeValue1-2
TestAttribute1-3 TestAttributeValue1-3

TestName2
TestAttribute2-1 TestAttributeValue2-1
TestAttribute2-2 TestAttributeValue2-2
TestAttribute2-3 TestAttributeValue2-3


TestName3
TestAttribute3-1 TestAttributeValue3-1
TestAttribute3-2 TestAttributeValue3-2
TestAttribute3-3 TestAttributeValue3-3

TestName4
TestAttribute4-1 TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2 TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3 TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

TestAttributes 的数量因测试而异。对于某些测试，只有 3 个值，而对于其他一些测试，则为 7 个，等等。同样如在 TestName4 示例中，一些测试执行了不止一次，因此每次执行都有自己的 TestAttributeValue 行。（在示例中 testname4 执行了 3 次，因此我们有 3 个值行）

我是python新手，知识不多，但想用python解析csv文件。我检查了 python 的 'csv' 库，不确定它对我来说是否足够，还是我应该编写自己的字符串解析器？你能帮帮我吗？

最好的

【问题讨论】：

您真的尝试过csv 模块吗？它奏效了吗？如果没有，什么没有起作用？
使用csv.reader 并将参数delimiter 设置为"," 将允许您以字符串列表的形式检索文件的内容。从那里你需要重新格式化整个结构。
@LutzHorn 实际上我无法详细查看 csv 模块，我希望我能在几个小时内有时间。但是，只要我理解，在我的情况下，仅用中间的“，”分隔文本才有用。所以我想那个 csv 模块有什么用？我可以通过编写一个简单的文本解析器来检查“，”是否存在。我很好奇 csv 模块是否比仅查找“，”并为我的案例分隔值更有用。我不知道我是否在寻找魔法:)
CSV 也可以命名为 DSV：分隔符分隔值。分隔符也可以是空格。您应该 1) 找到一种方法将输入拆分为块，以及 2) 将这些块解析为 CSV。

标签： python parsing csv

【解决方案1】：

以下内容可以满足您的要求，一次最多只能读取一个部分（为大文件节省内存）。将in_path和out_path分别替换为输入和输出文件路径：

import csv
def print_section(section, f_out):
    if len(section) > 0:
        # find maximum column length
        max_len = max([len(col) for col in section])
        # build and print each row
        for i in xrange(max_len):
            f_out.write('\t'.join([col[i] if len(col) > i else '' for col in section]) + '\n')
        f_out.write('\n')

with csv.reader(open(in_path, 'r')) as f_in, open(out_path, 'w') as f_out:
    line = f_in.next()
    section = []
    for line in f_in:
        # test for new "Test" section
        if len(line) == 3 and line[0] == 'Test' and line[2] == '':
            # write previous section data
            print_section(section, f_out)
            # reset section
            section = []
            # write new section header
            f_out.write(line[1] + '\n')
        else:
            # add line to section
            section.append(line)
    # print the last section
    print_section(section, f_out)

请注意，您需要将line[0] == 'Test' 语句中的'Test' 更改为正确的词以指示标题行。

这里的基本思想是我们将文件导入到列表列表中，然后使用数组推导将列表列表写回以进行转置（以及在列不均匀时添加空白元素）。

【讨论】：

a) 处理分隔文件时使用 csv 模块，b) 转置矩阵，使用zip(*iterable)
@SteinarLima a) 现在使用的模块。但是，在这种情况下，复杂性并没有降低。 b) zip(*iterable) 在不均匀的列中静默删除数据。根据我的经验，很少有用户希望数据以这种方式消失。
b) izip_longest from itertools 如果你不想要这种行为，可以使用。
@SteinarLima 谢谢！我忘了检查itertools。今天下班后我可能会更新上面的代码。
csv 模块在很多方面都优于split(',')——最重要的是它处理报价。 1,"me, you and him",2 行应该分成 3 部分，而不是 4 部分。

【解决方案2】：

我会使用使用itertools.groupby 函数和csv module 的解决方案。请仔细查看 itertools 的 documentation —— 你可以比你想象的更频繁地使用它！

我使用空行来区分数据集，这种方法使用惰性求值，一次只在内存中存储一个数据集：

import csv
from itertools import groupby

with open('my_data.csv') as ifile, open('my_out_data.csv', 'wb') as ofile:
    # Use the csv module to handle reading and writing of delimited files.
    reader = csv.reader(ifile)
    writer = csv.writer(ofile, delimiter='\t')
    # Skip info line
    next(reader)
    # Group datasets by the condition if len(row) > 0 or not, then filter
    # out all empty lines
    for group in (v for k, v in groupby(reader, lambda x: bool(len(x))) if k):
        test_data = list(group)
        # Write header
        writer.writerow([test_data[0][1]])
        # Write transposed data
        writer.writerows(zip(*test_data[1:]))
        # Write blank line
        writer.writerow([])

输出，假设提供的数据存储在my_data.csv：

TestName1
TestAttribute1-1    TestAttributeValue1-1
TestAttribute1-2    TestAttributeValue1-2
TestAttribute1-3    TestAttributeValue1-3

TestName2
TestAttribute2-1    TestAttributeValue2-1
TestAttribute2-2    TestAttributeValue2-2
TestAttribute2-3    TestAttributeValue2-3

TestName3
TestAttribute3-1    TestAttributeValue3-1
TestAttribute3-2    TestAttributeValue3-2
TestAttribute3-3    TestAttributeValue3-3

TestName4
TestAttribute4-1    TestAttributeValue4-1-1 TestAttributeValue4-2-1 TestAttributeValue4-3-1
TestAttribute4-2    TestAttributeValue4-1-2 TestAttributeValue4-2-2 TestAttributeValue4-3-2
TestAttribute4-3    TestAttributeValue4-1-3 TestAttributeValue4-2-3 TestAttributeValue4-3-3

【讨论】：