【问题标题】:How to split a CSV file on blank rows如何在空白行上拆分 CSV 文件
【发布时间】:2019-04-23 20:05:53
【问题描述】:

我有一个 CSV 文件,它在两个空白行之后开始一个新主题。我想将此文件拆分为两个不同的文件。我该怎么做?

................
................                
Biology I               
BGS Shivamogga I PUC    Exam Results            
Student Exam    # Questions Correct Answers Score %
ADARSHGOUDA M MUDIGOUDAR    Biology I - Chapter 1   35  23  65.70%
ADARSHGOUDA M MUDIGOUDAR    Biology I - Chapter 1   35  29  82.90%
ADARSHGOUDA M MUDIGOUDAR    Biology I - Chapter 1   35  32  91.40%
.
.
.
.

................
................                
Chemistry I             
BGS Shivamogga I PUC    Exam Results            
Student Exam    # Questions Correct Answers Score %
AISHWARYA P Chemistry I - Chapter 1 29  20  69.00%
MAHARUDRASWAMY M S  Chemistry I - Chapter 1 29  14  48.30%
NIKHIL B    Chemistry I - Chapter 1 29  20  69.00%

我曾尝试使用dropnasskiprows 来拆分数据帧,但我不想硬编码行数。我想根据前两个空白行进行拆分。

【问题讨论】:

  • 您是否尝试在本机 Python 中执行此操作,因为您提到的内容听起来像是您正在尝试使用 pandas 及其 read_csv...?
  • 您说的是“将此文件拆分为两个不同的文件”,您还谈到了“拆分数据框”——它是哪一个?请创建minimal reproducible example 并详细说明所需的结果。

标签: python python-3.x pandas csv


【解决方案1】:

我会按照以下方式做一些事情:

with open('input.txt','r') as input_file:
    data_str = input_file.read()
    data_array = data_str.split('\n\n') # Split on all instances of double new lines
    for i, smaller_data in enumerate(data_array):   
        with open(f'new_file_{i}.txt','w') as new_data_file:
            new_data_file.write(smaller_data)

【讨论】:

  • 这假定文件仅使用\n 作为换行符分隔符。 CSV 列值 也可以包含换行符,正确的 CSV 文件实际上会使用 CRLF (\r\n) 行分隔符。使用csv 模块将为我们处理这两个细节。
【解决方案2】:

我将只使用csv 模块,处理从csv.reader()csv.writer() 对象的行,并在进行过程中保持连续空白行的计数。每次发现多个空白行时,将写入对象替换为一个新文件。

您可以使用any() function 检测空行,因为空行将仅包含空字符串或根本不包含值:

isblank = not any(row)

假设同一目录中的编号文件就足够了,这应该可以工作:

import csv
from pathlib import Path

def gen_outputfiles(outputdir, basefilename):
    """Generate open files ready for CSV writing, in outputdir using basefilename

    Numbers are inserted between the basefilename stem and suffix; e.g.
    foobar.csv becomes foobar001.csv, foobar002.csv, etc.

    """
    outputbase = Path(basefilename)
    outputstem, outputsuffix = outputbase.stem, outpubase.suffix
    counter = 0
    while True:
        counter += 1
        yield outputdir / f'{outputstem}{counter:03d}{outputsuffix}'.open(mode='w', newline='')

def split_csv_on_doubleblanks(inputfilename, basefilename=None, **kwargs):
    """Copy CSV rows from inputfilename to numbered files based on basefilename

    A new numbered target file is created after 2 or more blank rows have been
    read from the input CSV file.

    """
    inputpath = Path(inputfilename)
    outputfiles = gen_outputfiles(inputpath.parent, basefilename or inputpath.name)

    with inputpath.open(newline='') as inputfile:
        reader = csv.reader(inputfile, **kwargs)
        outputfile = next(outputfiles())
        writer = csv.writer(outputfile, **kwargs)
        blanks = 0
        try:
            for row in reader:
                isblank = not any(row)
                if not isblank and blank > 1:
                    # skipped more than one blank row before finding a non-blank
                    # row. Open a new output file
                    outputfile.close()
                    outputfile = next(outputfile)
                    writer = csv.writer(outputfile, **kwargs)
                blank = blank + 1 if isblank else 0
                writer.writerow(row)
        finally:
            if not outputfile.closed:
                outputfile.close()

请注意,我也复制了空白行,因此您的文件最终会包含多个空白行。这可以通过将blanks 计数器替换为空白行列表来解决,以便在您想要重置计数器并且该列表中只有一个元素时写入写入器对象。这样会保留单个空白行。

【讨论】: