os.walk-ing 遍历目录结构以读取许多 CSV 标头并将它们写入输出 CSV答案

【问题标题】：os.walk-ing through a directory structure to read many CSV headers and write them to an output CSVos.walk-ing 遍历目录结构以读取许多 CSV 标头并将它们写入输出 CSV
【发布时间】：2018-09-21 22:57:52
【问题描述】：

我有一个包含 60 个文件夹的文件夹，每个文件夹包含大约 60 个 CSV（以及 1 个或 2 个非 CSV）。

我需要比较所有这些 CSV 的标题行，因此我试图通过目录并将输出 CSV (1) 相关文件的文件路径和 (2) 中的标题行写入输出 CSV 中行中的后续单元格。

然后转到下一个文件，在输出CSV的下一行写入相同的信息。

我在将标题行写入 CSV 的部分中迷失了——而且我迷失了，甚至无法生成错误消息。

谁能建议下一步该怎么做？

import os
import sys
import csv

csvfile = '/Users/username/Documents/output.csv'

def main(args):

    # Open a CSV for writing outputs to
    with open(csvfile, 'w') as out:
        writer = csv.writer(out, lineterminator='\n')

        # Walk through the directory specified in cmd line
        for root, dirs, files in os.walk(args):
            for item in files:
                # Check if the item is a CSV
                if item.endswith('.csv'):
                    # If yes, read the first row
                    with open(item, newline='') as f:
                        reader = csv.reader(f)
                        row1 = next(reader)
                        # Write the first cell as the file name
                        f.write(os.path.realpath(item))
                        f.write(f.readline())
                        f.write('\n')
                        # Write this row to a new line in the csvfile var
                            # Go to next file

                # If not a CSV, go to next file
                else:
                    continue

                # Write each file to the CSV
                # writer.writerow([item])

if __name__ == '__main__':
    main(sys.argv[1])

【问题讨论】：

标签： python csv

【解决方案1】：

IIUC 您需要一个包含 2 列的新 csv 文件：file_path 和 headers。如果您需要的标题只是该 csv 中的列名列表，那么如果您使用 pandas 数据框先存储这些值然后将数据框写入 csv 会更容易。

import pandas as pd

res = []
for root, dirs, files in os.walk(args):
    for item in files:
        # Check if the item is a CSV
        if item.endswith('.csv'):
            # If yes, read the first row
            df = pd.read_csv(item)
            row = {}
            row['file_path'] = os.path.realpath(item)
            row['headers'] = df.columns
            res.append(row)
res_df = pd.DataFrame(res)
res_df.to_csv(csvfile)

【讨论】：

谢谢你——但我实际上正在寻找 1 列，其中 file_path 后跟可变数量的列。基本上，我希望每个列标题出现在 output.csv 的不同列中

【解决方案2】：

您似乎对正在读取和写入的文件感到困惑。当你试图在一个大函数中做所有事情时，混乱是正常的。函数的全部意义在于将事物分解，以便于跟踪、理解和调试。

这里有一些代码，它不起作用，但您可以轻松地打印出每个函数返回的内容，一旦您知道这是正确的，您就可以将它提供给下一个函数。每个函数都很小，变量很少，所以不会出错。

最重要的是，每个函数中的变量都是本地的，这意味着它们不会干扰其他地方发生的事情，甚至会让您误以为它们可能会干扰（这会产生巨大的差异）。

def collect_csv_data():
    results = []
    for root, dirs, files in os.walk(args):
        for file in files:
            if file.endswith('.csv'):
                headers = extract_headers(os.path.join(root, file))
                results.append((file, headers))
    return results

def extract_headers(filepath):
    with open(filepath) as f:
        reader = csv.reader(f)
        headers = reader.next()
    return headers

def write_results(result, filepath):
    with open(filepath, 'w') as f:
        writer = csv.writer(f)
        for result in results:
            writer.writerow(result)

if __name__ == '__main__':
    directory = sys.argv[1]
    results = collect_csv_data(directory)
    write_results(results, 'results.csv')

【讨论】：