如何将数据文件拆分为多个部分以及每个拆分文件中的注释？答案

【问题标题】：How to split a datafile into multiple parts along with comments in each splitted files?如何将数据文件拆分为多个部分以及每个拆分文件中的注释？
【发布时间】：2016-06-18 07:13:14
【问题描述】：

我有一个这样的数据文件：

# coating file for detector A/R
# column 1 is the angle of incidence (degrees)
# column 2 is the wavelength (microns)
# column 3 is the transmission probability
# column 4 is the reflection probability
      14.2000     0.531000    0.0618000     0.938200
      14.2000     0.532000    0.0790500     0.920950
      14.2000     0.533000    0.0998900     0.900110
# it has lots of other lines
# datafile can be obtained from pastebin

输入数据文件的链接是： http://pastebin.com/NaNbEm3E

我喜欢从这个输入创建 20 个文件，这样每个文件都有 cmets 行。

那是：

#out1.txt
#comments
   first part of one-twentieth data

# out2.txt
# given comments
   second part of one-twentieth data

# and so on upto out20.txt

我们如何在 python 中做到这一点？

我最初的尝试是这样的：

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author    : Bhishan Poudel
# Date      : May 23, 2016


# Imports
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# read in comments from the file
infile = 'filecopy_multiple.txt'
outfile = 'comments.txt'
comments = []
with open(infile, 'r') as fi, open (outfile, 'a') as fo:
    for line in fi.readlines():
        if line.startswith('#'):
            comments.append(line)
            print(line)
            fo.write(line)


#==============================================================================
# read in a file
#
infile = infile
colnames = ['angle', 'wave','trans','refl']
print('{} {} {} {}'.format('\nreading file : ', infile, '','' ))
df = pd.read_csv(infile,sep='\s+', header = None,skiprows = 0,
                 comment='#',names=colnames,usecols=(0,1,2,3))
print('{} {} {} {}'.format('length of df : ', len(df),'',''))


# write 20 files
df = df
nfiles = 20
nrows = int(len(df)/nfiles)
groups = df.groupby(   np.arange(len(df.index)) / nrows   )
for (frameno, frame) in groups:
    frame.to_csv("output_%s.csv" % frameno,index=None, header=None,sep='\t')

到目前为止，我有 20 个拆分文件。我只想将 cmets 行复制到每个文件中。但问题是：how to do so?

应该有一些比仅使用 cmets 创建另外 20 个输出文件并将 20_splitted_files 附加到它们更简单的方法。

一些有用的链接如下：
How to split a dataframe column into multiple columns
How to split a DataFrame column in python
Split a large pandas dataframe

【问题讨论】：

不太清楚为什么在这种情况下需要 pandas/数据帧...您是要保留现有文件格式还是要将拆分后的文件保存为普通 CSV 或 HDF5 文件？
@MaxU 我想将分割后的文件保存为普通的 CSV 文件，这样每 20 个输出文件的头文件与输入文件的头文件相同。
您的原始 CSV 文件是否适合 RAM 或您必须逐行读取？
@MaxU 我的原始 CSV 文件适合 RAM，它不是很大的文件。

标签： python file-io

【解决方案1】：

更新：优化代码

fn = r'D:\download\input.txt'

with open(fn, 'r') as f:
    data = f.readlines()

comments_lines = 0
for line in data:
    if line.strip().startswith('#'):
        comments_lines += 1
    else:
        break

nfiles = 20
chunk_size = (len(data)-comments_lines)//nfiles

for i in range(nfiles):
    with open('d:/temp/output_{:02d}.txt'.format(i), 'w') as f:
        f.write(''.join(data[:comments_lines] + data[comments_lines+i*chunk_size:comments_lines+(i+1)*chunk_size]))
        if i == nfiles - 1 and len(data) > comments_lines+(i+1)*chunk_size:
            f.write(''.join(data[comments_lines+(i+1)*chunk_size:]))

原答案：

comments = []
data = []

with open('input.txt', 'r') as f:
    data = f.readlines()

i = 0
for line in data:
        if line.strip().startswith('#'):
            comments.append(line)
            i += 1
        else:
            break

data[:] = data[i:]

i=0
for x in range(0, len(data), len(data)//20):
    with open('output_{:02d}.txt'.format(i), 'w') as f:
        f.write(''.join(comments + data[x:x+20]))
        i += 1

【讨论】：

Traceback（最近一次调用最后一次）：文件“split_file_with_cmets.py”，第 25 行，在中 data = [line] + f.readlines() ValueError: 混合迭代和读取方法会丢失数据
@MaxU_ 我使用的是 macOS 10.9，这段代码对 python2 和 python3 显示相同的错误，我刚刚删除了 D:download\ 和 d:/temp/ 名称。 python3再次显示VALUE_ERROR
@BhishanPoudel，我已经更新了我的答案 - 请检查
@MaxU_ 现在它适用于我的 python2 和 python3。非常感谢！
@BhishanPoudel，我已经对代码进行了一些优化，所以它现在循环的次数要少得多——它应该工作得更快

【解决方案2】：

应该这样做

# Store comments in this to use for all files
comments = []

# Create a new sub list for each of the 20 files
data = []
for _ in range(20):
    data.append([])

# Track line number
index = 0

# open input file
with open('input.txt', 'r') as fi:
    # fetch all lines at once so I can count them.
    lines = fi.readlines()

    # Loop to gather initial comments
    line = lines[index]
    while line.split()[0] == '#':
        comments.append(line)
        index += 1
        line = lines[index]

    # Calculate how many lines of data
    numdata = len(lines) - len(comments)

    for i in range(index, len(lines)):
        # Calculate which of the 20 files I'm working with
        filenum = (i - index) * 20 / numdata
        # Append line to appropriately tracked sub list
        data[filenum].append(lines[i])

for i in range(1, len(data) + 1):
    # Open output file
    with open('output{}.txt'.format(i), 'w') as fo:
        # Write comments
        for c in comments:
            fo.write(c)
        # Write data
        for line in data[i-1]:
            fo.write(line)

【讨论】：

@piRSquared_ 非常感谢。