从 Python 中的多个文本文件中复制列答案

【问题标题】：Copy columns from multiple text files in Python从 Python 中的多个文本文件中复制列
【发布时间】：2013-01-13 20:12:17
【问题描述】：

我有大量包含数据的文本文件，它们排列成固定数量的行和列，列之间用空格分隔。（如 .csv 但使用空格作为分隔符）。我想从每个文件中提取一个给定的列，并将其写入一个新的文本文件。

到目前为止我已经尝试过：

results_combined = open('ResultsCombined.txt', 'wb')

def combine_results():
    for num in range(2,10):  
        f = open("result_0."+str(num)+"_.txt", 'rb') # all the text files have similar filename styles
        lines = f.readlines()   # read in the data
        no_lines = len(lines)   # get the number of lines

             for i in range (0,no_lines):
                 column = lines[i].strip().split(" ")

                 results_combined.write(column[5] + " " + '\r\n')

             f.close()

if __name__ == "__main__":
    combine_results()

这会生成一个文本文件，其中包含我想要从单独的文件中获取的数据，但作为单列。（即我已经设法将这些列“堆叠”在一起，而不是将它们作为单独的列并排放置）。我觉得我错过了一些明显的东西。

在另一次尝试中，我设法将所有单独的文件写入一个文件，但没有选择我想要的列。

import glob

files = [open(f) for f in glob.glob("result_*.txt")]  
fout = open ("ResultsCombined.txt", 'wb')

    for row in range(0,488):
      for f in files:
          fout.write( f.readline().strip() )
          fout.write(' ')
      fout.write('\n')

 fout.close()

我基本上想要的是从每个文件中复制第 5 列（它始终是同一列）并将它们全部写入一个文件。

【问题讨论】：

标签： python csv file-io

【解决方案1】：

如果您不知道文件中的最大行数并且文件是否可以放入内存，那么以下解决方案将起作用：

import glob

files = [open(f) for f in glob.glob("*.txt")]

# Given file, Read the 6th column in each line
def readcol5(f):
    return [line.split(' ')[5] for line in f]

filecols = [ readcol5(f) for f in files ]
maxrows = len(max(filecols, key=len))

# Given array, make sure it has maxrows number of elements.
def extendmin(arr):
    diff = maxrows - len(arr)
    arr.extend([''] * diff)
    return arr

filecols = map(extendmin, filecols)

lines = zip(*filecols)
lines = map(lambda x: ','.join(x), lines)
lines = '\n'.join(lines)

fout = open('output.csv', 'wb')
fout.write(lines)
fout.close()

【讨论】：

谢谢！我真的很喜欢这个解决方案 - 文件和行的数量可能会根据我在运行的模拟中使用的变量而有所不同，这样我就不必每次都检查最高的行号是多少。

【解决方案2】：

或者这个选项（按照你的第二种方法）：

import glob

files = [open(f) for f in glob.glob("result_*.txt")]  
fout = open ("ResultsCombined.txt", 'w')

for row in range(0,488):
   for f in files:
       fout.write(f.readline().strip().split(' ')[5])
       fout.write(' ')
   fout.write('\n')

fout.close()

... 每个文件使用固定数量的行，但适用于非常多的行，因为它没有将中间值存储在内存中。对于中等数量的行，我希望第一个答案的解决方案运行得更快。

【讨论】：

谢谢，我喜欢这种方法的紧凑性，有时我可能不得不处理大量的行。

【解决方案3】：

为什么不将每 5 列的所有条目读入一个列表，并在读入所有文件后，将它们全部写入输出文件？

data = [
    [], # entries from first file
    [], # entries from second file
    ...
]

for i in range(number_of_rows):
    outputline = []
    for vals in data:
        outputline.append(vals[i])
    outfile.write(" ".join(outputline))

【讨论】：

谢谢，我认为这为我指明了正确的方向！