Python：读写复杂重复格式的文件答案

【问题标题】：Python: Read and write the file of complex and reapeating formatPython：读写复杂重复格式的文件
【发布时间】：2013-12-13 20:26:32
【问题描述】：

首先，对不起可怜的英语。我有一个重复格式的文件。比如

      326                                         Iteration:       0 #Bonds:       10
    1    6    7   14   54   70   77    0    0    0    0    0    1  0.693  0.632  0.847  0.750  0.644  0.000  0.000  0.000  0.000  0.000  3.566  0.000  0.028
    2    6    3    6   15   55    0    0    0    0    0    0    1  0.925  0.920  0.909  0.892  0.000  0.000  0.000  0.000  0.000  0.000  3.645  0.000 -0.040
    3    6    2    8   10   52    0    0    0    0    0    0    1  0.925  0.910  0.920  0.898  0.000  0.000  0.000  0.000  0.000  0.000  3.653  0.000  0.000
...
  324    8  323    0    0    0    0    0    0    0    0    0  100  0.871  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.871  3.000 -0.493
  325    2  326    0    0    0    0    0    0    0    0    0  101  0.930  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.930  0.000  0.334
  326    8  325    0    0    0    0    0    0    0    0    0  101  0.930  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.930  3.000 -0.611
   637.916060425841        306.094529423257        1250.10511927236
  6.782126993565285E-006
      326 (repeating from here)                   Iteration:     100 #Bonds:       10
    1    6    7   14   54   64   70   77    0    0    0    0    1  0.885  0.580  0.819  0.335  0.784  0.709  0.000  0.000  0.000  0.000  4.111  0.000  0.025
    2    6    3    6   15   55    0    0    0    0    0    0    1  0.812  0.992  0.869  0.966  0.000  0.000  0.000  0.000  0.000  0.000  3.639  0.000 -0.034
    3    6    2    8   10   52    0    0    0    0    0    0    1  0.812  0.966  0.989  0.926  0.000  0.000  0.000  0.000  0.000  0.000  3.692  0.000  0.004

如你所见，第一行是表头，第2~327行是我要分析的数据，第328行和第329行有一些我不想用的数字。下一个“帧”从第 330 行开始，格式完全相同。这个“帧”重复了超过 200000 次。
我想使用每帧第 2~327 行数据中的第 1~13 列。我也想使用第一个标题。

我想分析数据，所有重复“帧”的第 2~327 行的第 3~12 列，打印每帧目标矩阵的 0 和非 0 数据的数量。还要打印一些第 1、第 2 和第 13 列。所以预期的输出文件变成了

326
  1
1    6    5    5    1
2    6    4    6    1
...
325  2    1    9  101
326  8    1    9  101
326 (Next frame starts from here)
  2
1    6    5    5    1
2    6    4    6    1
...
326
  3
1    6    5    5    1
2    6    4    6    1
...

第一行：第一行的第一个数字。
第二行：帧号
第 3~328 行：输入文件第 1 列，输入文件第 2 列，输入第 3~12 列非零数，输入第 3~12 列零个数，输入第 13 列。
从第 4 行开始：重复格式，同上。

因此，结果文件有 2 个标题行，分析数据 326 行，每帧总共 328 行。下一帧也重复相同的格式。建议使用该格式的结果数据（每个 5 个空格）将文件用于其他目的。

我使用的方式是，为 13 列创建 13 个数组 -> 为每帧和每 328 行使用双 for 循环存储数据。但我不知道如何处理输出。

以下是我的试用代码（未完成，仅供阅读输入），但是这段代码有很多问题。 Linecache 读取整行，而不是每一行的第一个数字。每帧都有 326+3=329 行，但似乎我的代码无法正常工作以进行逐帧工作。我欢迎任何帮助和协助分析这些数据。非常感谢您提前。

# Read the file
filename = raw_input("Enter the file name \n")
file = open(filename, 'r')

# Read the number of atom from header
import linecache
nnn = linecache.getline(filename, 1)
natoms = int(nnn)
singleframe = natoms + 3

# get number of frames
nlines = 0
for i1 in file:
    nlines = nlines +1
file.close()

nframes = nlines / singleframe

print 'no of lines are: ', nlines
print 'no of frames are: ', nframes
print 'no of atoms are:', natoms

# Create 1d string array
nrange = range(nlines)
data_lines = [None]*(nlines)

# Store whole input file into string array
file = open(filename, 'r')
i1=0
for i1 in nrange:
    data_lines[i1] = file.readline()
file.close()


# Create 1d array to store atomic data
at_index = [None]*natoms
at_type = [None]*natoms
n1 = [None]*natoms
n2 = [None]*natoms
n3 = [None]*natoms
n4 = [None]*natoms
n5 = [None]*natoms
n6 = [None]*natoms
n7 = [None]*natoms
n8 = [None]*natoms
n9 = [None]*natoms
n10 = [None]*natoms
molnr = [None]*natoms

nrange1= range(natoms)
nframe = range(nframes)

file = open('output_force','w')
print data_lines[9]
for j1 in nframe:
    start = j1*(natoms + 3) + 3
    for i1 in nrange1:
        line = data_lines[i1+start].split()  #Split each line based on spaces
        at_index[i1] = int(line[0])
        at_type[i1] = int(line[1])
        n1[i1]= int(line[2])
        n2[i1]= int(line[3])
        n3[i1]= int(line[4])
        n4[i1]= int(line[5])
        n5[i1]= int(line[6])
        n6[i1]= int(line[7])
        n7[i1]= int(line[8])
        n8[i1]= int(line[9])
        n9[i1]= int(line[10])
        n10[i1]= int(line[11])
        molnr[i1]= int(line[12])

【问题讨论】：

标签： python format data-analysis

【解决方案1】：

当您处理 csv 文件时，您应该查看csv module。我写了一个应该可以解决问题的代码。

此代码假定“数据良好”。如果您的数据集可能包含错误（例如列数少于 13，或数据行数少于 326），则应进行一些更改。

（更改为符合 Python 2.6.6）

import csv
with open('mydata.csv') as in_file:
    with open('outfile.csv', 'wb') as out_file:
        csv_reader = csv.reader(in_file, delimiter=' ', skipinitialspace=True)
        csv_writer = csv.writer(out_file, delimiter = '\t')

        # Iterate over all rows in the file
        for i, header in enumerate(csv_reader):
            # Get the header data
            num = header[0]
            csv_writer.writerow([num])

            # Write frame number, starting with 1 (hence the +1 part)
            csv_writer.writerow([i+1])

            # Iterate over all data rows
            for _ in xrange(326):

                # Call next(csv_reader) to get the next row
                # Put inside a try ... except to avoid StopIteration exception
                # if end of file is found before reaching 326 lines
                try:
                    row = next(csv_reader)
                except StopIteration:
                    break
                # Use list comprehension to extract number of zeros
                zeros = sum([1 for x in row[2:12] if x.strip() == '0'])
                not_zeros = 10 - zeros
                # Write the data to output file
                out = [row[0].strip(), row[1].strip(),not_zeros, zeros, row[12].strip()]
                csv_writer.writerow(out)
            # If the
            else:
                # Skip the last two lines of the file
                next(csv_reader)
                next(csv_reader)

对于前三行，这会产生：

326
1
1   6   5   5   1
2   6   4   6   1
3   6   4   6   1

【讨论】：

谢谢。我什至不知道有 csv 模块。这很棒。非常感谢。输入文件不是 csv 文件，而是从 fortran 作品中创建的，因此具有统一的格式。不用担心错误谢谢
第二行在逗号处给出语法错误。我怎样才能摆脱这种情况？这是一个非常初级的问题，但请原谅我，我以前从未使用过这样的模块。
您使用的是哪个版本的 Python？（在控制台写import sys;sys.version显示）
我刚刚在我的 unix 控制台中输入了 python，它会打印 Python 2.6.6
好的，2.7 及以上版本支持使用 with 语句的多个表达式。我将为您更改示例。（如果可能，您应该考虑升级到 Python 2.7）